Skip to content

Commit 6c855a9

Browse files
committed
feat(graph): T99.1.2 mark Gemma4PLECombinedProducer non-capturable
Gemma4PLECombinedProducer performs a CPU-side gather over the shared PLE embedding table and then calls MulScalar on the freshly-allocated CPUStorage tensor. Inside a CUDA graph capture stream this triggers a synchronous H2D cudaMemcpy that CUDA rejects with "operation would make the legacy stream depend on a capturing blocking stream". Add the op to nonCapturableOps so the producer runs in pre-capture on every forward, outside the capturing stream. The producer runs once per forward pass before the transformer loop, so this placement keeps the layer-body capture region intact. Companion change in zerfoo/inference/gemma4_edge_ple_nodes.go (E99 T99.1.2) pre-slices the producer's outputs into stable GPU buffers so pleSliceNode stays fully capturable. Decision recorded in zerfoo/docs/adr/088-gemma4-ple-cuda-graph-capture.md.
1 parent fd646fb commit 6c855a9

1 file changed

Lines changed: 15 additions & 7 deletions

File tree

graph/cuda_graph.go

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -44,14 +44,22 @@ var debugGraphCapture = os.Getenv("ZERFOO_DEBUG_GPU") == "1"
4444
// (offset_memcpy kernel) and GQA uses GPU RoPE selection (rope_select kernel),
4545
// all position-dependent state is read from GPU memory at replay time, making
4646
// GQA fully capturable.
47+
//
48+
// Gemma4PLECombinedProducer: performs a CPU-side gather over the shared PLE
49+
// embedding table (token ids -> per-layer rows), then calls MulScalar on the
50+
// freshly-allocated CPUStorage tensor. Running this inside a capture stream
51+
// triggers a synchronous H2D cudaMemcpy that CUDA rejects. The producer runs
52+
// once per forward pass before the transformer loop, so placing it in
53+
// pre-capture keeps the layer-body capture region intact. See ADR-088.
4754
var nonCapturableOps = map[string]bool{
48-
"EmbeddingLookup": true,
49-
"Gather": true,
50-
"AutoAttentionMask": true,
51-
"AutoPositionIds": true,
52-
"Slice": true,
53-
"ConstantOfShape": true,
54-
"Shape": true,
55+
"EmbeddingLookup": true,
56+
"Gather": true,
57+
"AutoAttentionMask": true,
58+
"AutoPositionIds": true,
59+
"Slice": true,
60+
"ConstantOfShape": true,
61+
"Shape": true,
62+
"Gemma4PLECombinedProducer": true,
5563
}
5664

5765
// isNonCapturable returns true if the instruction at index i in the plan

0 commit comments

Comments
 (0)