Gpu offloading bugfixes#2384
Open
ThrudPrimrose wants to merge 5 commits into
Open
Conversation
The previous SFINAE template covered every integer combination including signed exponents. That carried the same hidden assumption the original narrow ``pow(int, int)`` overload did: that ``b >= 0``. A signed exponent can be negative, in which case the mathematically correct value ``a^b`` is fractional (``1 / a^|b|``) and an integer return is wrong -- the old code papered over this by returning ``0`` for negative ``b``. Tighten the constraint: the integer overload is enabled only when the exponent type is *unsigned*. For every other case (any floating arg, any signed-integer exponent, or any complex arg) ``pow`` defers to ``std::pow`` and returns the natural floating / complex type. Existing complex behaviour is preserved -- complex types are not ``std::is_integral``, so the integer-only overload is disabled for them and the generic overload routes to ``std::pow`` exactly as before. Call sites that need an integer index must therefore declare the exponent symbol with an unsigned type (e.g. ``dace.uint32``); a signed ``int64_t`` exponent intentionally falls through to ``double``.
When the outer pattern is Array -> View -> Reduce, don't pull the View descriptor across the SDFG boundary (where it loses its viewed-source) and don't fall back to ExpandReducePure (which loses the fast library implementation). Instead, wrap the chain into a NestedSDFG whose external input connector is the *source* Array, and rebuild Array -> View -> Reduce *inside* with the inner Reduce set to use the CUB ``CUDA (device)`` expansion. The View structure is preserved (its viewed-source array is right there, ``get_view_edge`` resolves cleanly, validation is happy), and the actual reduction is delegated to the same fast library expansion that would have been used if the outer pattern were Array -> Reduce directly. Where CUB itself doesn't support a particular shape (e.g. multi-axis reduction without contiguous reduction dim) it has its own internal fallback to Pure -- that's its concern, and Pure handles Views fine. Helper ``_expand_view_input_via_nested`` builds the wrapper: - new descriptors: ``_in_src`` (source Array, input connector), ``_in_view`` (View, transient), ``_out`` (output, transient) - inner state: ``_in_src`` -> ``_in_view`` (via the outer view edge, retargeted to the inner names) -> inner Reduce -> ``_out`` - inner Reduce: same wcr/axes/identity, implementation='CUDA (device)' - outer state: input edge is rewired from ``view_node -> Reduce:_in`` to ``source_node -> Reduce:_in_src`` with the View's outer subset.
The right design for arrays read by a host interstate edge is to run
the map on GPU and copy out at the iedge boundary -- not pin the array
(and recursively, every connected one) to host.
Step 8 ('Make data ready for interstate edges that use them') already
exists and exactly does this for ''data_already_on_gpu'' / cloned input
& output arrays: it creates a ''host_<name>'' mirror, inserts an
''<block>_icopyout'' state copying the GPU array to the host mirror,
and rewrites the iedge reference from the device name to the host name.
Two precise gaps were leaving transients out of this machinery:
1. Step 4 promotes the outputs of GPU-scheduled
Map / LibraryNode / NestedSDFG to ''GPU_Global'' in place; it never
recorded them in ''data_already_on_gpu''. Step 8's ''cloned_data''
therefore missed them and the host iedge read went unmirrored.
Fix: a tiny ''_register_gpu_transient'' helper records every
transient promoted in Step 4 in ''data_already_on_gpu''. Same for
the analogous in-place promotion in Step 6.
2. Step 8 was iterating ''sdfg.all_control_flow_blocks()'' with the
default ''recursive=False'' -- so iedges nested inside a LoopRegion
(mandelbrot2's ''while ...: if I[j]: ...'') were never seen.
Same recursion fix for the second loop. The detection now uses
''e.data.used_arrays(sdfg.arrays)'' instead of
''e.data.free_symbols'' -- the former filters to actual array names
(skipping loop-iterator / scalar symbols that ''free_symbols''
would include).
Adds four Python-frontend reproducers in
tests/transformations/gpu_offloading_bugs_test.py covering each fixed
bug pattern (Reduce-on-View, host iedge read, host/GPU map straddle,
int-pow-as-index).
Two extra checkpoints making explicit that: - The ''pow(int64, uint32)'' case (signed base, unsigned exponent) hits the integer overload and stays integer-typed -- the design point that any non-negative exponent type is what unlocks the integer return. - Complex Reduce-on-View validates end-to-end. Complex types are not ''std::is_integral'', so the SFINAE integer-pow overload is disabled for them and the generic overload routes to ''std::pow'' unchanged. ''ExpandReduceGPUAuto'' clones descriptors dtype-blindly, so the View-handling path is the same for ''complex128'' as it is for ''float64''. The check is structural (the SDFG must validate after GPU offloading); numerical equivalence is the reduction body's concern.
Previous CI run hit ``QOSMaxSubmitJobPerUserLimit`` on daint without scheduling. Trivial docstring tweak to nudge a fresh pipeline.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GPU Offloading Bug Fixes
Branch
gpu-offloading-bugfixesoffmain.Bugs
dace/runtime/include/dace/math.hpow(int64, int64)returnsdouble→ invalid as array index / launch-dimdace/libraries/standard/nodes/reduce.pyExpandReduceGPUAutoleaves a dangling View in the nested SDFG when input is a ViewNestedSDFGwhose external connector is the source Array; rebuildArray → View → Reduceinside withimplementation='CUDA (device)'(CUB).dace/transformation/interstate/gpu_transform_sdfg.pyif I[j]:reads GPU-resident container → validation failsdata_already_on_gpu; Step 8 recurses intoLoopRegion/ConditionalBlockand usesused_arrays(sdfg.arrays)instead offree_symbols. Result: GPU map stays on GPU, Step 8 inserts ahost_<name>mirror + copy-out at the iedge boundary.Reproducers (
tests/transformations/gpu_offloading_bugs_test.py)I added a unit test for each reproducer. These reproducers try to mimic the patterns which resulting in GPU offloading fail on NPBench. With this PR the NPBench offloading should work.