Skip to content

ggml : address integer overflows in binary ops CUDA implementation#24706

Open
fairydreaming wants to merge 2 commits into
ggml-org:masterfrom
fairydreaming:binary-ops-overflows
Open

ggml : address integer overflows in binary ops CUDA implementation#24706
fairydreaming wants to merge 2 commits into
ggml-org:masterfrom
fairydreaming:binary-ops-overflows

Conversation

@fairydreaming

@fairydreaming fairydreaming commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Overview

This PR addresses the problem of integer overflows when processing large tensors with GGML binary OPs in CUDA backend. It's not a major overhaul, but rather a least-intrusive way of improving the current implementation.

Fixes #24643

Additional information

The problem was addressed by:

  • changing int type usage to uint32_t to double usable integer range,
  • adding casts so that multiplication result of two uint32_t won't overflow,
  • using asserts to verify before kernel launches that passed uint32_t parameters (and their products inside the kernels) will stay inside uint32_t type range,
  • making sure that i0 variable doesn't overflow when incremented in loop iterations,

I ran ./bin/test-backend-ops (all tests passed), ran failing tests from #24643 (passed) and compared performance of example small and large tensor f32 addition in old and new implementations (was the same new impl after adding size_t casts is slightly slower, ~1.2% for small tensors, ~0.3% for large ones).

Additionally I compared kernel register usage in old and new implementation. In most kernels (126) register usage was the same, in 17 kernels it was lower by 1-2 registers and in 10 kernels it was higher by 1-2 registers. With added size_t casts: unchanged in 74 kernels, lower by 1-2 registers in 15 kernels, higher by 1-4 registers in 64 kernels.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, discussed with AI possible ways of improving the implementation

@fairydreaming fairydreaming requested a review from a team as a code owner June 16, 2026 20:14
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning CUDA Related to the CUDA backend labels Jun 16, 2026

@ORippler ORippler left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share the measured performance numbers?

Comment thread ggml/src/ggml-cuda/binbcast.cu Outdated
const uint32_t i10 = fastmodulo(i0, ne10);

ggml_cuda_pdl_sync();
float result = src0_row ? (float) src0_row[i0*s00] : 0.0f;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not have to guard from overflow for i0*s00?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ORippler right, this (and other places where two multiplied uint32_t are used as array index) needs a cast too to avoid overflow. Will see how it affects performance.

@fairydreaming

Copy link
Copy Markdown
Collaborator Author

@ORippler performance:

Without PR:

$ ./bin/test-backend-ops perf -o "ADD"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0):              966420 runs -     1.04 us/run -       48 kB/run -   44.10 GB/s
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1],nf=1,perm1=0,src_overlap=0):            113378 runs -     8.86 us/run -    24576 kB/run - 2645.75 GB/s
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

With PR:

$ ./bin/test-backend-ops perf -o "ADD"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0):              892710 runs -     1.13 us/run -       48 kB/run -   40.62 GB/s
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1],nf=1,perm1=0,src_overlap=0):            113378 runs -     8.84 us/run -    24576 kB/run - 2650.32 GB/s
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Looks like I missed 10% perf degradation for very small tensors caused by size_t() casts, investigating.

@fairydreaming

fairydreaming commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

@ORippler Added missing size_t casts, now performance is even higher than original:

$ ./bin/test-backend-ops perf -o "ADD"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0):             1007370 runs -     0.99 us/run -       48 kB/run -   46.03 GB/s
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1],nf=1,perm1=0,src_overlap=0):            113378 runs -     8.86 us/run -    24576 kB/run - 2645.82 GB/s
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Register usage: unchanged in 74 kernels, lower by 1-2 registers in 15 kernels, higher by 1-4 registers in 64 kernels.

Edit: I did a clean recompile of the project and now perf is again a little lower (43.21 GB/s for small matrix, so ~1.2% slower, 2631.88 GB/s for large, so 0.3% slower). Honestly I have no idea what causes these fluctuations.

@ORippler

Copy link
Copy Markdown
Collaborator

Edit: I did a clean recompile of the project and now perf is again a little lower (43.21 GB/s for small matrix, so ~1.2% slower, 2631.88 GB/s for large, so 0.3% slower). Honestly I have no idea what causes these fluctuations.

The perf benchmarks do not factor out:

  • Cache hits (cache is not invalidated/flushed)
  • Clocking differences
  • cudaGraph reuse

which may affect performance and introduce r2r variation. At 1 and 8 us/run this is pretty fast already (typically one cannot go faster than 2-5 us as at these are the minimums pro/epilogue overheads of launching a cuda kernel (constructing & tearing down CTAs etc.)

@fairydreaming

Copy link
Copy Markdown
Collaborator Author

@ORippler Yeah, I checked clock values of my Max-Q and it was doing pretty wild swings. Locked it to 1410 MHz, increased test time from 1s to 10s and now got this (also added some more tests):

Test master GB/s this GB/s change
ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0) 28.27 29.55 +4.53%
ADD(type=f32,ne=[4096,256,1,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0) 1651.68 1638.44 -0.80%
ADD(type=f32,ne=[4096,256,16,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0) 1327.07 1322.73 -0.33%
ADD(type=f32,ne=[4096,256,16,4],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0) 1238.83 1236.64 -0.18%
ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1],nf=1,perm1=1,src_overlap=0) 28.41 29.75 +4.72%
ADD(type=f32,ne=[4096,256,1,1],nr=[1,1,1,1],nf=1,perm1=1,src_overlap=0) 759.65 756.38 -0.43%
ADD(type=f32,ne=[4096,256,16,1],nr=[1,1,1,1],nf=1,perm1=1,src_overlap=0) 848.00 847.67 -0.04%
ADD(type=f32,ne=[4096,256,16,4],nr=[1,1,1,1],nf=1,perm1=1,src_overlap=0) 854.89 854.88 -0.00%
ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1],nf=1,perm1=0,src_overlap=0) 1626.74 1629.70 +0.18%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA Related to the CUDA backend ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: CUDA ggml_add() fails for large tensors

3 participants