Gpu offloading bugfixes by ThrudPrimrose · Pull Request #2384 · spcl/dace

ThrudPrimrose · 2026-05-31T10:50:49Z

GPU Offloading Bug Fixes

Branch gpu-offloading-bugfixes off main.

Bugs

#	File	Symptom	Fix
1	`dace/runtime/include/dace/math.h`	`pow(int64, int64)` returns `double` → invalid as array index / launch-dim	SFINAE: integer overload for only when exponent is unsigned; everything else stays the same.
2	`dace/libraries/standard/nodes/reduce.py`	`ExpandReduceGPUAuto` leaves a dangling View in the nested SDFG when input is a View	Wrap into a `NestedSDFG` whose external connector is the source Array; rebuild `Array → View → Reduce` inside with `implementation='CUDA (device)'` (CUB).
3	`dace/transformation/interstate/gpu_transform_sdfg.py`	Host iedge `if I[j]:` reads GPU-resident container → validation fails	Step 4/6 register promoted transients in `data_already_on_gpu`; Step 8 recurses into `LoopRegion`/`ConditionalBlock` and uses `used_arrays(sdfg.arrays)` instead of `free_symbols`. Result: GPU map stays on GPU, Step 8 inserts a `host_<name>` mirror + copy-out at the iedge boundary.

Reproducers (`tests/transformations/gpu_offloading_bugs_test.py`)

I added a unit test for each reproducer. These reproducers try to mimic the patterns which resulting in GPU offloading fail on NPBench. With this PR the NPBench offloading should work.

The previous SFINAE template covered every integer combination including signed exponents. That carried the same hidden assumption the original narrow ``pow(int, int)`` overload did: that ``b >= 0``. A signed exponent can be negative, in which case the mathematically correct value ``a^b`` is fractional (``1 / a^|b|``) and an integer return is wrong -- the old code papered over this by returning ``0`` for negative ``b``. Tighten the constraint: the integer overload is enabled only when the exponent type is *unsigned*. For every other case (any floating arg, any signed-integer exponent, or any complex arg) ``pow`` defers to ``std::pow`` and returns the natural floating / complex type. Existing complex behaviour is preserved -- complex types are not ``std::is_integral``, so the integer-only overload is disabled for them and the generic overload routes to ``std::pow`` exactly as before. Call sites that need an integer index must therefore declare the exponent symbol with an unsigned type (e.g. ``dace.uint32``); a signed ``int64_t`` exponent intentionally falls through to ``double``.

When the outer pattern is Array -> View -> Reduce, don't pull the View descriptor across the SDFG boundary (where it loses its viewed-source) and don't fall back to ExpandReducePure (which loses the fast library implementation). Instead, wrap the chain into a NestedSDFG whose external input connector is the *source* Array, and rebuild Array -> View -> Reduce *inside* with the inner Reduce set to use the CUB ``CUDA (device)`` expansion. The View structure is preserved (its viewed-source array is right there, ``get_view_edge`` resolves cleanly, validation is happy), and the actual reduction is delegated to the same fast library expansion that would have been used if the outer pattern were Array -> Reduce directly. Where CUB itself doesn't support a particular shape (e.g. multi-axis reduction without contiguous reduction dim) it has its own internal fallback to Pure -- that's its concern, and Pure handles Views fine. Helper ``_expand_view_input_via_nested`` builds the wrapper: - new descriptors: ``_in_src`` (source Array, input connector), ``_in_view`` (View, transient), ``_out`` (output, transient) - inner state: ``_in_src`` -> ``_in_view`` (via the outer view edge, retargeted to the inner names) -> inner Reduce -> ``_out`` - inner Reduce: same wcr/axes/identity, implementation='CUDA (device)' - outer state: input edge is rewired from ``view_node -> Reduce:_in`` to ``source_node -> Reduce:_in_src`` with the View's outer subset.

The right design for arrays read by a host interstate edge is to run the map on GPU and copy out at the iedge boundary -- not pin the array (and recursively, every connected one) to host. Step 8 ('Make data ready for interstate edges that use them') already exists and exactly does this for ''data_already_on_gpu'' / cloned input & output arrays: it creates a ''host_<name>'' mirror, inserts an ''<block>_icopyout'' state copying the GPU array to the host mirror, and rewrites the iedge reference from the device name to the host name. Two precise gaps were leaving transients out of this machinery: 1. Step 4 promotes the outputs of GPU-scheduled Map / LibraryNode / NestedSDFG to ''GPU_Global'' in place; it never recorded them in ''data_already_on_gpu''. Step 8's ''cloned_data'' therefore missed them and the host iedge read went unmirrored. Fix: a tiny ''_register_gpu_transient'' helper records every transient promoted in Step 4 in ''data_already_on_gpu''. Same for the analogous in-place promotion in Step 6. 2. Step 8 was iterating ''sdfg.all_control_flow_blocks()'' with the default ''recursive=False'' -- so iedges nested inside a LoopRegion (mandelbrot2's ''while ...: if I[j]: ...'') were never seen. Same recursion fix for the second loop. The detection now uses ''e.data.used_arrays(sdfg.arrays)'' instead of ''e.data.free_symbols'' -- the former filters to actual array names (skipping loop-iterator / scalar symbols that ''free_symbols'' would include). Adds four Python-frontend reproducers in tests/transformations/gpu_offloading_bugs_test.py covering each fixed bug pattern (Reduce-on-View, host iedge read, host/GPU map straddle, int-pow-as-index).

Two extra checkpoints making explicit that: - The ''pow(int64, uint32)'' case (signed base, unsigned exponent) hits the integer overload and stays integer-typed -- the design point that any non-negative exponent type is what unlocks the integer return. - Complex Reduce-on-View validates end-to-end. Complex types are not ''std::is_integral'', so the SFINAE integer-pow overload is disabled for them and the generic overload routes to ''std::pow'' unchanged. ''ExpandReduceGPUAuto'' clones descriptors dtype-blindly, so the View-handling path is the same for ''complex128'' as it is for ''float64''. The check is structural (the SDFG must validate after GPU offloading); numerical equivalence is the reduction body's concern.

Previous CI run hit ``QOSMaxSubmitJobPerUserLimit`` on daint without scheduling. Trivial docstring tweak to nudge a fresh pipeline.

ThrudPrimrose added 5 commits May 31, 2026 12:43

ci: docstring rephrase to re-trigger CI

4236f23

Previous CI run hit ``QOSMaxSubmitJobPerUserLimit`` on daint without scheduling. Trivial docstring tweak to nudge a fresh pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gpu offloading bugfixes#2384

Gpu offloading bugfixes#2384
ThrudPrimrose wants to merge 5 commits into
mainfrom
gpu-offloading-bugfixes

ThrudPrimrose commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ThrudPrimrose commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU Offloading Bug Fixes

Bugs

Reproducers (tests/transformations/gpu_offloading_bugs_test.py)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ThrudPrimrose commented May 31, 2026 •

edited

Loading

Reproducers (`tests/transformations/gpu_offloading_bugs_test.py`)