Skip to content

Gpu offloading bugfixes#2384

Open
ThrudPrimrose wants to merge 5 commits into
mainfrom
gpu-offloading-bugfixes
Open

Gpu offloading bugfixes#2384
ThrudPrimrose wants to merge 5 commits into
mainfrom
gpu-offloading-bugfixes

Conversation

@ThrudPrimrose

@ThrudPrimrose ThrudPrimrose commented May 31, 2026

Copy link
Copy Markdown
Collaborator

GPU Offloading Bug Fixes

Branch gpu-offloading-bugfixes off main.

Bugs

# File Symptom Fix
1 dace/runtime/include/dace/math.h pow(int64, int64) returns double → invalid as array index / launch-dim SFINAE: integer overload for only when exponent is unsigned; everything else stays the same.
2 dace/libraries/standard/nodes/reduce.py ExpandReduceGPUAuto leaves a dangling View in the nested SDFG when input is a View Wrap into a NestedSDFG whose external connector is the source Array; rebuild Array → View → Reduce inside with implementation='CUDA (device)' (CUB).
3 dace/transformation/interstate/gpu_transform_sdfg.py Host iedge if I[j]: reads GPU-resident container → validation fails Step 4/6 register promoted transients in data_already_on_gpu; Step 8 recurses into LoopRegion/ConditionalBlock and uses used_arrays(sdfg.arrays) instead of free_symbols. Result: GPU map stays on GPU, Step 8 inserts a host_<name> mirror + copy-out at the iedge boundary.

Reproducers (tests/transformations/gpu_offloading_bugs_test.py)

I added a unit test for each reproducer. These reproducers try to mimic the patterns which resulting in GPU offloading fail on NPBench. With this PR the NPBench offloading should work.

The previous SFINAE template covered every integer combination including
signed exponents. That carried the same hidden assumption the original
narrow ``pow(int, int)`` overload did: that ``b >= 0``. A signed
exponent can be negative, in which case the mathematically correct value
``a^b`` is fractional (``1 / a^|b|``) and an integer return is wrong --
the old code papered over this by returning ``0`` for negative ``b``.

Tighten the constraint: the integer overload is enabled only when the
exponent type is *unsigned*. For every other case (any floating arg, any
signed-integer exponent, or any complex arg) ``pow`` defers to
``std::pow`` and returns the natural floating / complex type. Existing
complex behaviour is preserved -- complex types are not
``std::is_integral``, so the integer-only overload is disabled for
them and the generic overload routes to ``std::pow`` exactly as before.
Call sites that need an integer index must therefore declare the
exponent symbol with an unsigned type (e.g. ``dace.uint32``); a signed
``int64_t`` exponent intentionally falls through to ``double``.
When the outer pattern is Array -> View -> Reduce, don't pull the View
descriptor across the SDFG boundary (where it loses its viewed-source)
and don't fall back to ExpandReducePure (which loses the fast library
implementation). Instead, wrap the chain into a NestedSDFG whose
external input connector is the *source* Array, and rebuild
Array -> View -> Reduce *inside* with the inner Reduce set to use the
CUB ``CUDA (device)`` expansion.

The View structure is preserved (its viewed-source array is right there,
``get_view_edge`` resolves cleanly, validation is happy), and the
actual reduction is delegated to the same fast library expansion that
would have been used if the outer pattern were Array -> Reduce directly.
Where CUB itself doesn't support a particular shape (e.g. multi-axis
reduction without contiguous reduction dim) it has its own internal
fallback to Pure -- that's its concern, and Pure handles Views fine.

Helper ``_expand_view_input_via_nested`` builds the wrapper:
- new descriptors: ``_in_src`` (source Array, input connector),
  ``_in_view`` (View, transient), ``_out`` (output, transient)
- inner state: ``_in_src`` -> ``_in_view`` (via the outer view edge,
  retargeted to the inner names) -> inner Reduce -> ``_out``
- inner Reduce: same wcr/axes/identity, implementation='CUDA (device)'
- outer state: input edge is rewired from ``view_node -> Reduce:_in``
  to ``source_node -> Reduce:_in_src`` with the View's outer subset.
The right design for arrays read by a host interstate edge is to run
the map on GPU and copy out at the iedge boundary -- not pin the array
(and recursively, every connected one) to host.

Step 8 ('Make data ready for interstate edges that use them') already
exists and exactly does this for ''data_already_on_gpu'' / cloned input
& output arrays: it creates a ''host_<name>'' mirror, inserts an
''<block>_icopyout'' state copying the GPU array to the host mirror,
and rewrites the iedge reference from the device name to the host name.
Two precise gaps were leaving transients out of this machinery:

1. Step 4 promotes the outputs of GPU-scheduled
   Map / LibraryNode / NestedSDFG to ''GPU_Global'' in place; it never
   recorded them in ''data_already_on_gpu''. Step 8's ''cloned_data''
   therefore missed them and the host iedge read went unmirrored.
   Fix: a tiny ''_register_gpu_transient'' helper records every
   transient promoted in Step 4 in ''data_already_on_gpu''. Same for
   the analogous in-place promotion in Step 6.

2. Step 8 was iterating ''sdfg.all_control_flow_blocks()'' with the
   default ''recursive=False'' -- so iedges nested inside a LoopRegion
   (mandelbrot2's ''while ...: if I[j]: ...'') were never seen.
   Same recursion fix for the second loop. The detection now uses
   ''e.data.used_arrays(sdfg.arrays)'' instead of
   ''e.data.free_symbols'' -- the former filters to actual array names
   (skipping loop-iterator / scalar symbols that ''free_symbols''
   would include).

Adds four Python-frontend reproducers in
tests/transformations/gpu_offloading_bugs_test.py covering each fixed
bug pattern (Reduce-on-View, host iedge read, host/GPU map straddle,
int-pow-as-index).
Two extra checkpoints making explicit that:

- The ''pow(int64, uint32)'' case (signed base, unsigned exponent) hits
  the integer overload and stays integer-typed -- the design point that
  any non-negative exponent type is what unlocks the integer return.

- Complex Reduce-on-View validates end-to-end. Complex types are not
  ''std::is_integral'', so the SFINAE integer-pow overload is disabled
  for them and the generic overload routes to ''std::pow''
  unchanged. ''ExpandReduceGPUAuto'' clones descriptors dtype-blindly,
  so the View-handling path is the same for ''complex128'' as it is
  for ''float64''. The check is structural (the SDFG must validate
  after GPU offloading); numerical equivalence is the reduction
  body's concern.
Previous CI run hit ``QOSMaxSubmitJobPerUserLimit`` on daint without
scheduling. Trivial docstring tweak to nudge a fresh pipeline.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant