Skip to content

Explicit copy memset nodes#2380

Open
ThrudPrimrose wants to merge 73 commits into
mainfrom
explicit-copy-memset-nodes
Open

Explicit copy memset nodes#2380
ThrudPrimrose wants to merge 73 commits into
mainfrom
explicit-copy-memset-nodes

Conversation

@ThrudPrimrose

Copy link
Copy Markdown
Collaborator

No description provided.

ThrudPrimrose and others added 30 commits May 13, 2026 16:43
Introduce explicit copy / memset library nodes and a pass that lifts
every implicit AccessNode -> AccessNode (or scope-staging) edge into
a CopyLibraryNode, so the dataflow shape post-simplification is
self-describing and the legacy "copy edge gets lowered by ad-hoc
codegen" path can be deprecated.

  dace/libraries/standard/nodes/copy_node.py     (CopyLibraryNode + 8 expansions)
  dace/libraries/standard/nodes/memset_node.py   (MemsetLibraryNode + 3 expansions)
  dace/libraries/standard/helper.py              (shared expansion helpers)
  dace/libraries/standard/environments/cpu.py    (CPU environment used by ExpandMemcpyCPU)
  dace/sdfg/construction_utils.py                (small utilities used by copy_node)
  dace/transformation/passes/insert_explicit_copies.py
                                                 (the pass)

Pass wiring: appended `InsertExplicitCopies` to `SIMPLIFY_PASSES` so
``SDFG.simplify()`` lifts implicit copies as part of standard cleanup
(idempotent — once lifted, no further implicit edges remain to match).

Tests:
  tests/library/copy_node_test.py
  tests/library/memset_node_test.py
  tests/passes/insert_explicit_copies_test.py
…ryNode

Three small fixes that unblock the Copy/Memset libnodes' schedule
inference:

  * ``infer_out_connector_type`` / ``infer_connector_types``: wrap the
    ``e.data.subset and e.data.subset.num_elements() == 1`` expression
    in ``bool()``. A single-element ``Range`` returns False from
    ``__bool__`` and the bare ``and`` chain leaks the Range object
    instead of a bool, which later trips the ``scalar |= …`` operator
    with TypeError when the libnode's collapsed subset is empty.
  * ``_determine_schedule_from_storage``: when the node under inspection
    is a ``CopyLibraryNode`` / ``MemsetLibraryNode`` AND any neighbouring
    memlet imposes a ``GPU_Device`` constraint, return ``GPU_Device``
    directly. Without this, an H2D copy has both ``CPU_Multicore``
    (from the CPU source) and ``GPU_Device`` (from the GPU sink) in its
    constraint set and the existing ``len(constraints) > 1`` branch
    raises ``InvalidSDFGNodeError: Cannot determine default schedule
    for node copy_A_to_gpu_A``. The libnode is exactly the node class
    designed to bridge storages; routing it to GPU_Device when GPU is
    involved is the intended resolution.
…simplify

InsertExplicitCopies materialises implicit AccessNode->AccessNode edges
into CopyLibraryNodes — a shape-changing lowering step, not a
shape-preserving simplification. Including it in SIMPLIFY_PASSES broke
22 tests (numpy reshape/flatten/view, redundant_copy count assertions,
range_indirection / reinterpret validators) that rely on simplify
preserving the implicit-edge form.

Pass remains available as a standalone Pass for consumers that want it
(e.g. the GPU codegen lowering pipeline, which calls it explicitly via
InsertExplicitGPUGlobalMemoryCopies).
Three xfail-strict pins:
- AccessNode<->View edges must not be lifted (policy: views are aliases).
- Rank-changing reshape lift produces mismatched-rank memlets that trip
  codegen IndexError in cpp_offset_expr.
- Dtype-reinterpret View lift produces CopyLibraryNode with mismatched
  element types, failing sdfg.validate().
…atch

InsertExplicitCopies changes:
- Drop the AN<->View edge lift (was inserting a Copy + intermediate buffer
  for every AN<->View edge, including pure aliasing edges that don't need
  one).
- Add a round-trip collapse: AN_src -> View -> AN_dst becomes a single
  AN_src -> CopyLibraryNode -> AN_dst direct edge with the composed memlet
  (src side from the view-underlying subset, dst side from the access-side
  subset). The View AccessNode is removed when it has no other consumers.

CopyLibraryNode expansion:
- select_copy_implementation now routes rank-mismatched volume-equal copies
  (different rank after collapse_shape_and_strides) to CopyNDTemplate when
  both sides are same-storage C-packed contiguous. MappedTasklet reuses a
  single access expression for both endpoints, which produces a
  rank-mismatched memlet on the smaller side and crashes codegen.
- ExpandCopyNDTemplate flattens to a 1D pointer walk when the in/out
  collapsed shapes have different ranks (both sides packed contiguous, so
  the linearization is sound).

Tests:
- Update the 4 view-lift tests to assert the new policy (round-trip collapses
  to a single Copy; View is removed if it has no other consumers; repeat
  applies are no-ops).
- Add 10 new test_iec_* pins covering both patterns: AN<->View edges kept
  direct, AN->View->AN round-trip collapse, AN->View round-trip with another
  consumer keeping the View, AN<->AN copies with rank differing because of a
  constant-index dim, AN<->AN rank-mismatched volume-equal copies routed
  through CopyND, plus @dace.program reshape and reinterpret cases.
…onstruction_utils

Use SDFG.parent (O(1)) instead of the recursive _get_parent_state scan
to find the state containing a nested-SDFG node; verified equivalent.
copy_node.py imports the helper from dace.transformation.helpers
(top-level, no import cycle).  dace/sdfg/construction_utils.py removed
(it only held these two helpers on this branch).
…trides in memset, Range.num_elements over reduce
ThrudPrimrose and others added 30 commits May 20, 2026 15:13
…rings

Same-rank copy with mismatched per-dim shapes (e.g. (3,4) -> (4,3))
previously slipped through libnode validation and tripped DaCe's
generic ``out-of-bounds memlet`` post-expansion validate -- the
rejection was incidental. Add a per-dim shape check at the head of
ExpandMappedTasklet's same-rank branch with a specific error
message; the test now asserts on that explicit contract instead of
the incidental SDFG-level error.

Trim multi-line test docstrings (transpose-pattern, rank-mismatch
variants, no-common-stride1, shared-memory collective, single-
element-in-kernel) to one line each.
… array shapes

The transpose-pattern check compares ``in_shape_collapsed`` vs
``out_shape_collapsed``, which come from
``collapse_shape_and_strides(subset, strides)`` -- per-dim subset
sizes after singleton-collapse, NOT the underlying array shapes.

Add two pin tests:

  test_copy_same_subset_different_array_shapes -- 0:N slice between
  arrays of different total size is fine when the per-dim subset
  sizes match.

  test_copy_1d_slice_from_2d_source -- a row-slice ``[i, 0:N]`` of
  a 2D array copies into a 1D array; the leading singleton
  collapses to the same rank on both sides.

Both pass; transpose-pattern still rejected.
….X form

Consistency over brevity: every test now uses the full
dace.dtypes.StorageType.{CPU_Heap,GPU_Global,GPU_Shared,Register}
inline instead of a mix of local cpu/gpu aliases and inline long
forms. Drops 26 local-alias lines and removes the inconsistency
across the file. Two over-long docstrings trimmed to keep yapf
happy.
Previously select_memset_implementation picked CUDA / CPU purely from
storage type. cudaMemsetAsync / memset zero ``num_elements * sizeof(T)``
consecutive bytes from the dst pointer, so a non-contiguous subset
(e.g. a middle 2-D slice of a row-major array) silently zeroes memory
outside the region. Auto now routes such cases to the 'pure'
(mapped-tasklet) expansion, which writes per element via the subset.

ExpandCUDA / ExpandCPU also reject non-contiguous subsets upfront with
a clear error so explicit forcing still raises.

Tests:
  test_memset_auto_routes_non_contiguous_to_pure_cpu  -- Auto succeeds
    via 'pure', zeroes only the 6x10 sub-block.
  test_memset_cpu_rejects_non_contiguous_subset  -- explicit CPU raises.
  test_memset_cuda_rejects_non_contiguous_subset  -- explicit CUDA raises.

Also: in copy_node_test.py, factor out _make_copy_skeleton + add
_make_legacy_copy_sdfg (canonical Memlet(data=dst, subset=dst_subset,
other_subset=src_subset) form). The previous _strip_libnodes mutation
was producing memlets without other_subset, which made the legacy
codegen look broken on patterns it actually handles correctly. Keep
the genuine libnode-advantage pin -- rank-mismatch 4D->2D Fortran
reshape -- delete the three false-positive tests.
- helper.py: drop collapsed_map_lengths, inlined as
  [s for s in subset.size() if s != 1] (2 callsites; Range.size()
  is the existing util that does the same job).
- helper.py: shorten CURRENT_STREAM_NAME comment.
- insert_explicit_copies.py: drop src_locations / dst_locations /
  skip_inside_device_scope properties + _storage_allowed helper.
  Zero callers in dace/ or tests/ pass any of them. Drop the
  Iterable / is_devicelevel_gpu imports that go with them.
- insert_explicit_copies.py: fix the class docstring -- the pass
  only handles AN -> AN (and AN -> View -> AN via the round-trip
  collapse). Map-staging patterns are not handled on this branch
  (they were on explicit-gpu-global-copies; D1 outermost-subset
  bug; removed pending a correct rewrite).
…side the map scope

AN -> MapEntry -> AN (stage-in) and AN -> MapExit -> AN (stage-out)
edges now lift to a CopyLibraryNode placed INSIDE the map scope,
wired directly to MapEntry's output connector / MapExit's input
connector (no intermediate AN inserted on the scope-side). Chains
of MapEntries / MapExits are followed via memlet_path; the body
of the map (tasklets, NestedSDFGs, nested maps) is irrelevant to
the lift. Views on the outer side stay in place.

The outer-side memlet (per-iteration subset on the outer array)
is preserved verbatim on the new MapEntry -> libnode (or
libnode -> MapExit) edge; the inner-array-side memlet is derived
via the existing _derive_matching_dst_subset against the inner
array's descriptor.

Tests in tests/passes/insert_explicit_copies_test.py:

  test_lift_stage_in_copy
  test_lift_stage_out_copy
  test_lift_stage_in_copy_through_view
  test_lift_stage_out_copy_through_view
  test_lift_stage_in_copy_chained_map_entries
  test_lift_stage_out_copy_chained_map_exits
  test_lift_stage_in_copy_with_nested_sdfg_consumer

Each asserts: exactly one libnode in the lifted state, libnode's
scope owner is the (innermost) MapEntry, libnode input/output
wired directly to MapEntry/MapExit (not via an inserted AN),
numerical match against NumPy, and zero `CopyND<` template
instantiations in generated code.

New helper _assert_no_copynd(sdfg) calls generate_code and
greps each CodeObject for `CopyND<`; pinned in all seven new
staging tests.

Endpoint resolution refactored to sdutils.find_input_arraynode /
find_output_arraynode instead of inline memlet_path[0/-1] +
isinstance checks.
- Drop the unused third argument from _derive_matching_dst_subset
  (docstring already said "unused; kept for symmetric signature");
  update both direct-copy callsites accordingly.
- Fold _is_stage_in_candidate / _is_stage_out_candidate /
  _insert_stage_in_libnode / _insert_stage_out_libnode into one
  _lift_staging_edge(..., stage_in: bool) -- the four methods
  differed only by which side of the edge was the inner AN and
  which was the MapEntry/MapExit.

-30 LoC net.
Introduce _compile_no_copynd(sdfg) wrapper: greps every CodeObject
emitted by sdfg.generate_code() for 'CopyND<' and asserts none,
then returns sdfg.compile(). Apply to all 17 compile sites in
copy_node_test.py (including the libnode-side compile of the
legacy comparison test).

Pins the contract: libnode expansions displace the runtime CopyND
fallback entirely. The only intentional CopyND user is
ExpandSharedMemoryCollective, whose test inspects tasklet bodies
directly without compiling, so no exemption is needed.
…cks; factor staging-test scaffold

dace/transformation/passes/insert_explicit_copies.py
- Drop dead ``if src_subset is None or dst_subset is None`` guard
  (_resolve_subset_for always returns a Range).
- Inline _expr_lt (4 lines, one caller in _is_consecutive_reshape);
  the try/except + intent comment now live where they're used.

dace/libraries/standard/nodes/copy_node.py
- Inline _coarse_pick_for_storage_pair (single caller in
  select_copy_implementation).
- Inline _cuda2d_strides_are_supported (single caller in
  _refine_cuda_impl_for_subsets); the explanatory comment moves with
  the logic.

tests/passes/insert_explicit_copies_test.py
- Factor the structural assertion shared across the seven staging
  tests into _assert_lifted_libnode(state, side, expected_scope=...)
  -- replaces ~5 lines of "find libnode + check scope + check wire"
  per test with one call.
- Pull "list View AccessNodes" into _view_an_names helper.

tests/library/{copy_node,memset_node}_test.py +
tests/passes/insert_explicit_copies_test.py
- Replace explicit ``if __name__ == "__main__": test_a(); test_b(); ...``
  blocks with ``pytest.main([__file__])`` so the script form picks up
  tests automatically without a hand-maintained list.
…shape

Three lines collapsed to one. The comment still explains why the
try/except swallows the exception silently (symbolic indeterminacy +
equal-product safety net) without the docstring-shaped prose that
was inherited from the deleted _expr_lt helper.
…ion pass

Functions that already receive parent_state no longer take a redundant
parent_sdfg/sdfg argument; they read the owning SDFG from state.sdfg.
This covers the copy/memset expansion helpers, the auto_dispatch shim,
CopyLibraryNode.src_storage/dst_storage, and the InsertExplicitCopies
private methods. The framework-fixed validate(self, sdfg, state) and the
ExpandTransformation.expansion(node, parent_state, parent_sdfg) signatures
are left intact.

Add copy_node tests for a padded (1, N) array whose unit leading dim
carries a non-packed stride: is_contiguous_subset is False, so Auto falls
back to a map (same storage) or a pitched cudaMemcpy2D (cross storage),
and the copy stays numerically exact.
ExpandMemcpyCPU and the CUDA1D helper duplicated the same validate ->
contiguous-check -> size -> pointer-Tasklet body, differing only in the
cross-storage flag and the memcpy vs cudaMemcpyAsync code string. Fold them
into one _make_memcpy_tasklet(node, parent_state, *, cuda), mirroring
MemsetLibraryNode's _make_memset_tasklet. Generated code is unchanged.
Imports: drop the copy/subsets/data aliases and inline imports for plain
top-level imports; add type hints to the module helpers.

Reduce _derive_matching_dst_subset to a single volume check via
subsets.Range.num_elements + dace.symbolic.equal, removing the four-branch
shape ladder, the hand-rolled symbolic equality, and _is_consecutive_reshape
(whose two-pointer walk only ever compared total volumes).

Stop collapsing AN -> View -> AN round-trips. A View is an Array subclass
with its own shape/strides, so _replace_direct_copies now lifts any
View<->Array movement edge and skips only the view's alias edge, leaving the
view in place as a copy endpoint (AN -> View -> Copy -> AN). Deletes
_collapse_round_trip_views and the now-unused _resolve_subset_for; updates the
view tests with structural + numerical checks for both view directions.
…e, test dedup

- _replace_direct_copies now resolves src/dst subsets via Memlet.get_src_subset /
  get_dst_subset (the memlet path) instead of an ambiguous data-name else branch;
  the self-copy convention stays explicit.
- ExpandSharedMemoryCollective's GPU_ThreadBlock guard uses the existing
  dace.sdfg.scope.is_in_scope, so get_parent_map_and_loop_scopes is dropped from
  transformation/helpers.py (the PR no longer modifies that file).
- Tests route SDFG construction through shared builders: view round-trips
  (_make_view_round_trip_sdfg, src + dst), chained-map staging
  (_build_chained_stage_sdfg), reshape (_run_reshape_copy_test), and the
  array_to_array cases (one parametrized test) via _build_copy_sdfg.
- Polybench tests reuse the canonical tests/polybench kernels instead of inline
  copies (covariance/correlation imported, fdtd-2d loaded by path under a clean
  module name); the __main__-only 'import polybench' in those three files moves
  under __main__ so they import without the absl CLI dependency.
- Shortened/cleaned comments and docstrings.
Three related dispatcher fixes in ``select_copy_implementation`` plus a
relaxation in ``ExpandTasklet`` so each case has a working expansion.

1. Shared <-> Register routes to ``Tasklet`` (single-element) or
   ``MappedTasklet`` (multi-element), never ``SharedMemoryCollective``
   (whose expansion explicitly rejects a Register endpoint and was
   crashing ``samples/optimization/matmul.py --version optimize_gpu``).

2. Shared <-> {Shared, Global} placed *inside* a ``GPU_ThreadBlock``
   map routes per-thread, same as Shared <-> Register. The collective
   expansion is itself the block-level op and rejects this placement.

3. Invariant: no single-element copy is ever routed to ``MappedTasklet``
   (a 0-D map crashes in memlet propagation). Steps 1 and 2 of the
   dispatcher handle the single-element case explicitly; Step 3+ only
   see multi-element inputs.

``ExpandTasklet`` is relaxed to accept any single-element copy regardless
of storage pair -- the volume check at the top of the expansion already
rejects multi-element, and the dispatcher now intentionally routes
thread-level Shared single-element copies here.

Tests in ``tests/library/copy_node_test.py``:

- 7 exact-impl unit tests, one per unique routing rule
- 1 parameterised invariant test enumerating every (src, dst) storage
  pair at single-element volume, asserting != ``MappedTasklet``
- 1 ``__syncthreads()`` emission check at single-element volume
- 3 correctness roundtrip kernels: variant A (cooperative load OUTSIDE
  ``tblock_map``), variant B (per-thread load INSIDE ``tblock_map``),
  and a full Global -> Shared -> Register -> Register -> Shared -> Global
  pipeline

Regressions verified across both ``DACE_compiler_cuda_implementation``
values (legacy + experimental): 82/82 ``copy_node_test`` pass; full
sweep of ``tests/library``, ``tests/transformations``,
``tests/codegen``, ``tests/sdfg``, ``tests/npbench`` shows 0 code
regressions.
…esolved subset

The previous lift cloned the inner edge's Memlet onto the new outer-side
edge (``MapEntry -> libnode`` for stage-in, ``libnode -> MapExit`` for
stage-out). When the inner edge's Memlet was named dst-relative
(``data == inner_node.data`` with the outer-side subset in
``other_subset``), the cloned Memlet carried the wrong ``data`` field and
SDFG validation rejected the lifted edge with
``Memlet data does not match source or destination data nodes``.

Construct the outer-side Memlet explicitly: query
``get_src_subset`` / ``get_dst_subset`` to resolve the subset in the
outer array's index space, then build ``Memlet(data=outer.data,
subset=...)`` -- propagating ``dynamic`` and ``wcr`` from the original.
The inner-side subset is derived from the resolved outer subset rather
than from ``outer_memlet.subset`` (which may be inner-relative).

Test: ``tests/codegen/argument_signature_test.py::test_argument_signature_compiles_and_runs``
constructs exactly this dst-relative pattern (``A -> MapEntry/OUT_A ->
tmp_in`` with ``Memlet(data='tmp_in', subset=0, other_subset=__i1)``)
and now passes under both legacy and experimental codegens.
``tests/passes/insert_explicit_copies_test.py``: 51/51 (experimental),
53/53 (legacy) -- no regressions.
The outgoing Memlet at a scope exit can be source-relative -- naming the
inner transient rather than the external array being written -- so using its
.data dropped the real destination array (and its stride symbols) from a GPU
kernel's argument list, yielding 'identifier undefined' at compile. Resolve
the written array from the memlet tree's root (the outermost-scope node, the
actual fan-out destination) instead, matching the long-standing NOTE here.
The previous form bundled the ``arglist`` API check with an optional
``sdfg.compile()`` gated on a cupy import. Split into:

- ``test_argument_signature_test``: CPU-only ``arglist`` assertion that
  the indirect ``A`` / ``D`` references are reported as kernel arguments.
- ``test_argument_signature_compiles_and_runs``: ``@pytest.mark.gpu``
  end-to-end build + run that exercises the kernel-signature path the
  ``arglist`` fix at ``075052309`` repaired (the bug surfaced as
  ``identifier "D" is undefined`` at ``nvcc``).

The shared SDFG-construction code is factored into
``_make_indirect_reference_sdfg`` so both tests use the same fixture.
No functional change. Trim the dst-relative-Memlet note in
``_lift_staging_edge`` to single-paragraph form, matching the
comment density in ``copy_node.py``.
That fixed some obscure cases:
- It was not possible to change the backend, which only affect systems that have an Nvidia _and_ AMD GPU.
- `lru_cache` is not needed `cache` is fully adequate.
In case only a single element is copied the copy is always continuous.

---------

Co-authored-by: Philip Muller <phimuell@santis>
… connector

Cherry-pick of 30f5e0b minus the experimental_cuda KernelSpec piece
(file doesn't exist on this branch).

1) ``_refine_cuda_impl_for_subsets`` / ``_make_mapped_tasklet_expansion``:
   replace sympy ``==`` shape comparisons with ``dace.symbolic.inequal_symbols``,
   which first runs ``equalize_symbols`` to merge same-named SymPy symbols
   carrying different assumption sets (e.g. ``Symbol('N', integer=True)`` vs
   ``Symbol('N', integer=True, positive=True)``). Without this the per-dim
   shape check rejected identical shapes as 'permutations', the 2D
   cuMemcpy2D selector misrouted, and the strided-stride / shape check
   spuriously failed.

2) ``_make_expansion_sdfg``: when the libnode carries the
   ``__dace_current_stream`` in-connector (gpuStream_t), register a matching
   scalar in the wrapper SDFG so the resulting NestedSDFG passes validation.
   Legacy codegen never adds the connector, so the branch is a no-op there.
Cherry-pick of the ``tests/library/copy_node_test.py`` portion of
714dec0 (the PromoteGPUScalarsToArrays parts of that commit don't
apply -- file doesn't exist on this branch).

``test_copy_cuda_1d_single_element`` passed raw strings as strides
(``strides=["src_stride"]``) which ``Array.validate`` rejects ("Strides
must be ... integer values or symbols"). The scaffold now sympifies
stride entries -- which causes ``add_datadesc`` to auto-register the
stride symbol -- so the test's own ``sdfg.add_symbol("src_stride",
dace.int32)`` collided and is removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants