Commit 6c855a9
committed
feat(graph): T99.1.2 mark Gemma4PLECombinedProducer non-capturable
Gemma4PLECombinedProducer performs a CPU-side gather over the shared
PLE embedding table and then calls MulScalar on the freshly-allocated
CPUStorage tensor. Inside a CUDA graph capture stream this triggers a
synchronous H2D cudaMemcpy that CUDA rejects with "operation would make
the legacy stream depend on a capturing blocking stream".
Add the op to nonCapturableOps so the producer runs in pre-capture on
every forward, outside the capturing stream. The producer runs once
per forward pass before the transformer loop, so this placement keeps
the layer-body capture region intact.
Companion change in zerfoo/inference/gemma4_edge_ple_nodes.go
(E99 T99.1.2) pre-slices the producer's outputs into stable GPU
buffers so pleSliceNode stays fully capturable.
Decision recorded in zerfoo/docs/adr/088-gemma4-ple-cuda-graph-capture.md.1 parent fd646fb commit 6c855a9
1 file changed
Lines changed: 15 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
47 | 54 | | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
55 | 63 | | |
56 | 64 | | |
57 | 65 | | |
| |||
0 commit comments