ggml : address integer overflows in binary ops CUDA implementation by fairydreaming · Pull Request #24706 · ggml-org/llama.cpp

fairydreaming · 2026-06-16T20:13:59Z

Overview

This PR addresses the problem of integer overflows when processing large tensors with GGML binary OPs in CUDA backend. It's not a major overhaul, but rather a least-intrusive way of improving the current implementation.

Fixes #24643

Additional information

The problem was addressed by:

changing int type usage to uint32_t to double usable integer range,
adding casts so that multiplication result of two uint32_t won't overflow,
using asserts to verify before kernel launches that passed uint32_t parameters (and their products inside the kernels) will stay inside uint32_t type range,
making sure that i0 variable doesn't overflow when incremented in loop iterations,

I ran ./bin/test-backend-ops (all tests passed), ran failing tests from #24643 (passed) and compared performance of example small and large tensor f32 addition in old and new implementations (~~was the same~~ new impl after adding size_t casts is slightly slower, ~1.2% for small tensors, ~0.3% for large ones).

Additionally I compared kernel register usage in old and new implementation. ~~In most kernels (126) register usage was the same, in 17 kernels it was lower by 1-2 registers and in 10 kernels it was higher by 1-2 registers~~. With added size_t casts: unchanged in 74 kernels, lower by 1-2 registers in 15 kernels, higher by 1-4 registers in 64 kernels.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, discussed with AI possible ways of improving the implementation

ORippler

Could you share the measured performance numbers?

ORippler · 2026-06-17T08:39:45Z

+    const uint32_t i10 = fastmodulo(i0, ne10);

    ggml_cuda_pdl_sync();
    float result = src0_row ? (float) src0_row[i0*s00] : 0.0f;


Why do we not have to guard from overflow for i0*s00?

@ORippler right, this (and other places where two multiplied uint32_t are used as array index) needs a cast too to avoid overflow. Will see how it affects performance.

fairydreaming · 2026-06-17T09:19:15Z

@ORippler performance:

Without PR:

$ ./bin/test-backend-ops perf -o "ADD"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0):              966420 runs -     1.04 us/run -       48 kB/run -   44.10 GB/s
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1],nf=1,perm1=0,src_overlap=0):            113378 runs -     8.86 us/run -    24576 kB/run - 2645.75 GB/s
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

With PR:

$ ./bin/test-backend-ops perf -o "ADD"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0):              892710 runs -     1.13 us/run -       48 kB/run -   40.62 GB/s
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1],nf=1,perm1=0,src_overlap=0):            113378 runs -     8.84 us/run -    24576 kB/run - 2650.32 GB/s
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Looks like I missed 10% perf degradation for very small tensors caused by size_t() casts, investigating.

fairydreaming · 2026-06-17T10:12:38Z

@ORippler Added missing size_t casts, now performance is even higher than original:

$ ./bin/test-backend-ops perf -o "ADD"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0):             1007370 runs -     0.99 us/run -       48 kB/run -   46.03 GB/s
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1],nf=1,perm1=0,src_overlap=0):            113378 runs -     8.86 us/run -    24576 kB/run - 2645.82 GB/s
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Register usage: unchanged in 74 kernels, lower by 1-2 registers in 15 kernels, higher by 1-4 registers in 64 kernels.

Edit: I did a clean recompile of the project and now perf is again a little lower (43.21 GB/s for small matrix, so ~1.2% slower, 2631.88 GB/s for large, so 0.3% slower). Honestly I have no idea what causes these fluctuations.

ORippler · 2026-06-17T12:56:31Z

Edit: I did a clean recompile of the project and now perf is again a little lower (43.21 GB/s for small matrix, so ~1.2% slower, 2631.88 GB/s for large, so 0.3% slower). Honestly I have no idea what causes these fluctuations.

The perf benchmarks do not factor out:

Cache hits (cache is not invalidated/flushed)
Clocking differences
cudaGraph reuse

which may affect performance and introduce r2r variation. At 1 and 8 us/run this is pretty fast already (typically one cannot go faster than 2-5 us as at these are the minimums pro/epilogue overheads of launching a cuda kernel (constructing & tearing down CTAs etc.)

fairydreaming · 2026-06-17T14:07:17Z

@ORippler Yeah, I checked clock values of my Max-Q and it was doing pretty wild swings. Locked it to 1410 MHz, increased test time from 1s to 10s and now got this (also added some more tests):

Test	master GB/s	this GB/s	change
`ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0)`	28.27	29.55	+4.53%
`ADD(type=f32,ne=[4096,256,1,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0)`	1651.68	1638.44	-0.80%
`ADD(type=f32,ne=[4096,256,16,1],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0)`	1327.07	1322.73	-0.33%
`ADD(type=f32,ne=[4096,256,16,4],nr=[1,1,1,1],nf=1,perm1=0,src_overlap=0)`	1238.83	1236.64	-0.18%
`ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1],nf=1,perm1=1,src_overlap=0)`	28.41	29.75	+4.72%
`ADD(type=f32,ne=[4096,256,1,1],nr=[1,1,1,1],nf=1,perm1=1,src_overlap=0)`	759.65	756.38	-0.43%
`ADD(type=f32,ne=[4096,256,16,1],nr=[1,1,1,1],nf=1,perm1=1,src_overlap=0)`	848.00	847.67	-0.04%
`ADD(type=f32,ne=[4096,256,16,4],nr=[1,1,1,1],nf=1,perm1=1,src_overlap=0)`	854.89	854.88	-0.00%
`ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1],nf=1,perm1=0,src_overlap=0)`	1626.74	1629.70	+0.18%

ggml : address integer overflows in binary ops CUDA implementation

d9b66bc

fairydreaming requested a review from a team as a code owner June 16, 2026 20:14

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning CUDA Related to the CUDA backend labels Jun 16, 2026

ORippler reviewed Jun 17, 2026

View reviewed changes

ggml : add size_t casts to avoid integer overflows

bf95565

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : address integer overflows in binary ops CUDA implementation#24706

ggml : address integer overflows in binary ops CUDA implementation#24706
fairydreaming wants to merge 2 commits into
ggml-org:masterfrom
fairydreaming:binary-ops-overflows

fairydreaming commented Jun 16, 2026 •

edited

Loading

Uh oh!

ORippler left a comment

Uh oh!

ORippler Jun 17, 2026

Uh oh!

fairydreaming Jun 17, 2026

Uh oh!

fairydreaming commented Jun 17, 2026

Uh oh!

fairydreaming commented Jun 17, 2026 •

edited

Loading

Uh oh!

ORippler commented Jun 17, 2026

Uh oh!

fairydreaming commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fairydreaming commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

ORippler left a comment

Choose a reason for hiding this comment

Uh oh!

ORippler Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

fairydreaming Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

fairydreaming commented Jun 17, 2026

Uh oh!

fairydreaming commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler commented Jun 17, 2026

Uh oh!

fairydreaming commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fairydreaming commented Jun 16, 2026 •

edited

Loading

fairydreaming commented Jun 17, 2026 •

edited

Loading