ggml : address integer overflows in binary ops CUDA implementation#24706
ggml : address integer overflows in binary ops CUDA implementation#24706fairydreaming wants to merge 2 commits into
Conversation
ORippler
left a comment
There was a problem hiding this comment.
Could you share the measured performance numbers?
| const uint32_t i10 = fastmodulo(i0, ne10); | ||
|
|
||
| ggml_cuda_pdl_sync(); | ||
| float result = src0_row ? (float) src0_row[i0*s00] : 0.0f; |
There was a problem hiding this comment.
Why do we not have to guard from overflow for i0*s00?
There was a problem hiding this comment.
@ORippler right, this (and other places where two multiplied uint32_t are used as array index) needs a cast too to avoid overflow. Will see how it affects performance.
|
@ORippler performance: Without PR: With PR: Looks like I missed 10% perf degradation for very small tensors caused by size_t() casts, investigating. |
|
@ORippler Added missing size_t casts, now performance is even higher than original: Register usage: unchanged in 74 kernels, lower by 1-2 registers in 15 kernels, higher by 1-4 registers in 64 kernels. Edit: I did a clean recompile of the project and now perf is again a little lower (43.21 GB/s for small matrix, so ~1.2% slower, 2631.88 GB/s for large, so 0.3% slower). Honestly I have no idea what causes these fluctuations. |
The perf benchmarks do not factor out:
which may affect performance and introduce r2r variation. At 1 and 8 us/run this is pretty fast already (typically one cannot go faster than 2-5 us as at these are the minimums pro/epilogue overheads of launching a cuda kernel (constructing & tearing down CTAs etc.) |
|
@ORippler Yeah, I checked clock values of my Max-Q and it was doing pretty wild swings. Locked it to 1410 MHz, increased test time from 1s to 10s and now got this (also added some more tests):
|
Overview
This PR addresses the problem of integer overflows when processing large tensors with GGML binary OPs in CUDA backend. It's not a major overhaul, but rather a least-intrusive way of improving the current implementation.
Fixes #24643
Additional information
The problem was addressed by:
inttype usage touint32_tto double usable integer range,uint32_twon't overflow,uint32_tparameters (and their products inside the kernels) will stay insideuint32_ttype range,i0variable doesn't overflow when incremented in loop iterations,I ran
./bin/test-backend-ops(all tests passed), ran failing tests from #24643 (passed) and compared performance of example small and large tensor f32 addition in old and new implementations (was the samenew impl after adding size_t casts is slightly slower, ~1.2% for small tensors, ~0.3% for large ones).Additionally I compared kernel register usage in old and new implementation.
In most kernels (126) register usage was the same, in 17 kernels it was lower by 1-2 registers and in 10 kernels it was higher by 1-2 registers. With addedsize_tcasts: unchanged in 74 kernels, lower by 1-2 registers in 15 kernels, higher by 1-4 registers in 64 kernels.Requirements