[WIP] Modernize perf analysis and memory volume by ThrudPrimrose · Pull Request #2387 · spcl/dace

ThrudPrimrose · 2026-06-02T14:08:43Z

Testing needs to be completed.

… memory-volume Combines PR #2356 (work-depth + operational-intensity adapted to ControlFlowRegions) with PR #2357 (symbolic memory-traffic analysis, total_volume.py). Updates the work-depth Transpose import to its current dace.libraries.linalg location (moved out of standard by #2317).

Move the 'import polybench' (and the absl-backed entry point) into each kernel's '__main__' block so the modules can be imported as plain kernel libraries for analysis/tests without requiring absl-py. Behavior when run as a script is unchanged.

…ybench Replace the TestAnalyzeSdfg class with module-level test_* functions, add the copyright header and Sphinx docstrings, trim unused imports, and remove the duplicated test_jacobi_1d (defined twice) and the inline jacobi kernels: per-kernel volume checks on real kernels move to polybench_analysis_test.

Remove a stray sdfg.save('sdfg.sdfg') debug write, fix main() unpacking a 4-tuple from analyze_sdfg's 2-tuple return, correct invalid type-hint syntax, replace bare 'except:' with 'except Exception', avoid mutable default arguments, de-alias imports, and add the copyright header, module docstring and Sphinx docstrings.

Remove the stray 'from unittest import result' in work_depth_test and replace a bare 'except:' with 'except Exception' in the operational- intensity interstate-edge handling.

Replace the incomplete hard-coded zero-work cast list with one derived from the available DaCe data types plus Python/C builtins (so casts like double, dace.uint16 no longer warn), and treat int_floor/int_ceil as one integer division each.

Add an optimize parameter (default True) so callers can request the raw data volume without auto-optimization (which can inject vectorization-tiling artifacts). Route summations through safe_summation, which strips the positive assumptions that make SymPy's posify raise on DaCe size symbols, so triangular-loop kernels (cholesky/lu/ludcmp) no longer crash.

Map scopes multiplied body work by the iteration count, which is only correct when the work is independent of the iteration variable; for triangular nests (an inner map bound by an outer parameter, e.g. syrk, symm, trmm, correlation, covariance) this left the outer index free in the result. Introduce a shared accumulate_over_range helper that summates work (and depth, for loops) over the iteration domain -- reducing to a multiplication when independent -- and use it from both the loop and map handlers so they accumulate identically.

A symbolic step (e.g. a tile size) cannot be compared with > 0; decide the iteration direction from step.is_negative instead, treating an unknown-sign step as forward (map steps are never negative).

A symbol read location may be any ControlFlowBlock (an SDFGState, or a loop or conditional region reading the symbol in its condition/bounds), not only an SDFGState or interstate edge. The scope-coarsening check assumed the non-state case was always an edge and dereferenced .src, raising on a LoopRegion. Reduce any control-flow block to itself and only edges to their source block before testing reachability.

Interstate-edge assignments mix value computation (e.g. the bitwise updates to a CRC kernel's loop-carried scalars) with pure addressing (index/bound helpers). Add compute_symbols, which classifies a symbol as compute when it is read inside a tasklet or transitively feeds such a symbol, and count an edge assignment's arithmetic toward work only for compute symbols, so addressing arithmetic no longer inflates the work estimate.

Add leading-order flop/depth counters for the linalg solver library nodes: Cholesky ~N^3/3, Inv ~2N^3, Solve ~2N^3/3 + 2N^2 per right-hand side, each with an O(N) sequential factorization depth. These apply to the library node directly (e.g. when auto-optimization keeps it), complementing the existing matmul/gemm/gemv/dot counters.

…nels Add polybench_analysis_test, which reuses the canonical tests/polybench kernels (no duplicate definitions) and pins the compute work and (un- optimized) memory volume of all 30, with the optimized memory path exercised for robustness. Also covers the linalg solver library-node counters, the compute-vs-address classification of interstate-edge arithmetic, the loop-carried integer-compute case, and the data-dependent sparse matrix-vector known limitation.

Resolve Max/Min nodes in the per-iteration volume when the summation bound fixes the comparison sign (e.g. a triangular nest where the inner index is bounded by the outer one), so cholesky/lu/ludcmp produce a closed-form read volume instead of an unevaluated sum. SymPy's refine cannot do this, and the same-named symbols carry inconsistent assumptions, so the resolver compares in a canonical namespace. Create symbols via dace.symbolic.symbol rather than sympy.Symbol throughout, classify interstate edges with isinstance instead of getattr, and pin the now-closed cholesky/lu/ludcmp volumes in the test.

…p-cost test Spell out in the total_volume module docstring and the polybench test where the bytes-moved and flop figures come from: total_volume counts each accessed region once per enclosing map nest and multiplies by enclosing sequential-loop trip counts (reuse within a parallel nest, flush across loops -- not the I/O-optimal IOLB lower bound), and the flop count is work-depth's realised operation count (cross-referenced with PolyBench/Python CC'21 and polybench_set_program_flops). Add a test showing a user can override a function's flop cost (sin 1 -> 65).

…dd docstrings Replace the manual 'next(e for e in state.in_edges(node) if e.dst_conn==...)' connector lookups in the work-depth library-node counters with the canonical SDFGState.in_edges_by_connector / out_edges_by_connector API. Derive total_volume.get_static_symbols' type-cast list from dace.dtypes (matched longest-first) instead of hard-coding it. Add concise docstrings to the matmul/reduce/dot/cholesky/inv/solve counters.

The control-flow-region and inlined forms produce identical work, differing only in the auto-generated loop-execution-count symbol name (num_execs_<cfg>_<node>), which is structure-dependent. Canonicalize that symbol in the standardize comparison helper so the two forms compare equal, drop the now-unnecessary skips on the *_inlined variants, and remove a stray sdfg.save debug write.

Replace diagnostic print() calls in the work-depth and operational-intensity analyses with warnings.warn (unrecognized tasklet function, loop fallback, analysis-failed, only-map-scopes, no-merge-state, uncommon iedge assignment, high MAPE), remove debug prints (the legacy-loop trace and the interactive read/write-set dump), and drop the dead AccessStack.debug_print method. CLI main() result output and the interactive ask_user prompts are kept.

Replace the traceback.print_exc debug dump in the operational-intensity cache-miss helper with warnings.warn (and drop the now-unused traceback import), narrow a bare except in helpers.get_static_symbols, move the cfg import in operational_intensity to module scope, give the lazy matplotlib import in op_in_helpers.plot a reason comment, and remove a dead commented assignment in work_depth.

Two bugs in the ConditionalBlock handling of cfr_misses, which implements the interactive ask_user branch picker: - Picking the implicit-else option crashed: a stray reassignment replaced the real else-branch region object with the literal string 'else_branch', which has no start_block. Removed it; the surviving assignment already carries the real region and the prompt already renders "else_branch". - A branch chosen once was not reused on later visits of the same conditional. The documented behavior (reuse the prior decision) was unimplemented, so a conditional inside a loop fell through to the worst-case path that explores all branches and takes the max every iteration. Added the decided-branch reuse case. Adds two tests: branch selection honors the user's choice (different-work branches yield different intensities), and a conditional inside a loop prompts once and applies the chosen branch on every iteration.

The combined PR introduced two copies of get_static_symbols and subs_till_fixed_point: an older one in helpers.py (used by work_depth and operational_intensity) with a hard-coded type-name list and raw sympy, and an improved one in total_volume.py deriving the type names from dace.dtypes and using dace.symbolic. Keep a single improved implementation in helpers.py (the shared module) and import it from total_volume. get_static_symbols keeps returning string-keyed mappings, which both substitution (expr.subs) and by-name indexing (mapping[loop_var] in operational_intensity) rely on, so all three analyses now share the better type-cast handling with no change to their contracts.

Cover the simulation-based path of analyze_sdfg_op_in: giving a symbol a range 'start,stop,step' samples it, simulates cache misses for each sample, and fits the operational intensity as a function of that symbol. For the streaming single_map64 kernel the fitted intensity is the constant 1/24, which the test checks at several sizes.

When control flow is inlined to plain conditional interstate edges (no ConditionalBlock), an undecidable data-dependent branch is handled by the legacy traversal in cfg_misses. The non-ask_user path was broken: it referenced an uninitialized curr_misses (crashing with UnboundLocalError) and computed a final_misses/final_e that were discarded, so it never actually accounted the chosen branch. Rewrite it to mirror the ConditionalBlock handling: analyze every candidate branch on copies of the mapping and stack, then continue along the worst-case (most misses) one, committing its state before resuming at the merge state. Adds a test that inlines a branchy program and checks the legacy traversal yields the same operational intensity as its ConditionalBlock form.

Make the operational-intensity test docstrings concise and trim the worst-case branch comment, with no change to behavior.

Replace the hand-rolled BFS in find_states_between with a light wrapper around the existing dace.sdfg.utils.nodes_in_all_simple_paths utility, and drop the now unused deque import. find_merge_state already used cfg.branch_merges; correct its stale docstring to describe that instead of an unrelated function.

Remove the matmul/gemm/gemv/dot connector lookups whose results were never used, leaving only the memlets each counter actually reads.

The operational intensity is collected into op_in_map by side effect, so the returned miss count was unused. Call cfg_misses without binding it.

The analyses now require structured control flow (loops as LoopRegion, branches as ConditionalBlock) and do not model early loop exits. This removes a large amount of legacy machinery: - helpers.py: drop the legacy loop-detection cluster (get_domtree, get_backedges, NodeCycle, LoopExtractionError, find_loop_guards_tails_exits, get_legacy_loop_body, get_legacy_loop_ranges). - work_depth.py: drop the "legacy loops" section that summed and broke cycles. - operational_intensity.py: drop the legacy interstate-branch handling (find_merge_state, find_states_between, mem_accesses_on_path and the candidate/worst-case block); the state walk now follows the single structured successor. In their place, a shared helpers.has_unstructured_control_flow detects a legacy loop (a cycle outside a LoopRegion), unstructured branching (a block with more than one outgoing edge), or break/continue (BreakBlock / ContinueBlock). Each of the three analyses checks it up front and, if unstructured, warns and returns a zero result rather than a wrong one. Tests: replace the inlined work-depth/op-intensity tests (which exercised the removed legacy paths) with tests asserting the warning and zero result on inlined SDFGs and on break/continue kernels; add the same for memory volume.

return is a non-local exit the analyses do not model, so treat ReturnBlock like break/continue in has_unstructured_control_flow. With break/continue/return all caught by the up-front bail, the (ReturnBlock, ContinueBlock, BreakBlock) -> zero-cost branches in work-depth, operational-intensity and memory-volume are unreachable; remove them and the now-unused imports. Extend the work-depth bail test to cover an early-return kernel.

…tract Fix the stale control_flow_region_work_depth docstring (it no longer breaks loops; structured regions are already DAGs) and a ControlFlowRegion typo, and note on each public entry point (work-depth, operational-intensity, memory-volume) that only structured control flow is supported and unstructured input yields a warned zero result.

…h import, no banners, test reuse) - op_in_helpers: drop the unnecessary `from __future__ import annotations`; quote the one self-referential annotation instead. - tests/polybench/{2mm,3mm,floyd-warshall,gemm,jacobi-2d}: replace the try/except import-polybench fallback with a plain `import polybench`. - polybench_analysis_test: replace dashed banner comments with plain comments. - total_volume_test: factor the duplicated LoopRegion construction into make_loop_volume_sdfg.

… int_floor Replace the production sp.Symbol creation sites (loop var, num_execs, library-node work/depth/misses placeholders) with dace.symbolic.symbol, and the parsed array-index symbols with pystr_to_symbolic (which handles compound indices and emits dace symbols). Use dace.symbolic.int_floor for the symbolic loop-range divisions in accumulate_over_range and scope_volume. An unbounded loop may have no loop variable, so guard symbol creation (None then forces the execution-count fallback).

… test normalizers Convert the symbol-creation sites in the assumption engine (assumptions.py) and the assumption-test fixtures to dace.symbolic.symbol, so symbols are dace symbols everywhere. Fix the equal-symbol self-reference check to compare by name (DaCe symbols are not interned, so the old `is` identity check no longer held). With symbols now consistent, remove the test-side assumption-stripping (standardize / reps) and rebuild the auxiliary expected symbols (num_execs, _p_i, Reduce_misses) as the exact dace symbols the analyses emit. Also use `import dace` rather than `import dace as dc`.

Use dace.symbolic.pystr_to_symbolic for the expected-value conversions in the comparison helpers, and compare the bail results against plain (0, 0).

…) instead of raw sympy Replace sp.sympify with dace.symbolic.pystr_to_symbolic and sp.simplify with dace.symbolic.simplify across the analyses and tests. The remaining sympy uses (Max/Min/log/Sum/Abs/Piecewise/oo/N and the isinstance(x, sp.Symbol/sp.Basic) checks and sp.Expr type hints) have no dace.symbolic equivalent and stay. Standardize the dace.symbolic imports to `from dace.symbolic import <names>` across all changed files (renaming the colliding local `symbol` variable in assumptions.parse_assumptions to `lhs`).

operational_intensity: factor the identical in-/out-edge cache-miss accounting in scope_misses into a single _edge_miss helper, and collapse update_map_iterators' two near-identical SymExpr/non-SymExpr branches into one. total_volume: factor scope_volume's symmetric read/write handling into _access_volume (global-memory edge byte volume) and _accumulate_volume_over_var (sum a per-iteration volume over one loop/map var).

AccessStack.copy / AccessStack.in_cache_as_list / CacheLineTracker.copy had no callers left after the unstructured-control-flow branch exploration was removed (the conditional handler deep-copies the stack instead). Drop them and the now-unused deque import.

…e polybench total_volume: resolve_minmax_over_range unpacked the canonicalized Max/Min assuming two arguments, but canonicalization can collapse it (e.g. Max(j, j-1) -> j); guard the unpack and map a collapsed node to its surviving operand. The resulting exception was swallowed by cfr_volume's loop fallback while a range-var-stack frame was already pushed, leaking the frame and multiplying a sibling loop nest's volume by a spurious range -- this turned lu/ludcmp read volume into a spurious O(N**5) (and write into O(N**3)). Pop the frame in finally so a failure in the recursion cannot leak it. lu/ludcmp now report the O(N**4) bounding-box read and O(N**2) write. polybench_analysis_test: pin each kernel's work/read/write as a (symbolic, value-at-_SIZES) pair side-by-side and assert the analysis matches the closed form symbolically (not only at one size); re-pin lu/ludcmp to the corrected volumes.

…l(x, -1) total_volume: estimate a map's accessed region as the tighter (Min) of the propagated boundary memlet (a bounding box, which over-counts the gap between disjoint slices such as a row and a column of a triangular access) and the sum of the per-connector footprints (each inner edge propagated over the map on its own, which over-counts overlapping slices such as a stencil's neighbourhood). Both are valid upper bounds on the working set, so their Min is a tighter valid bound -- and exact for the polybench corpus. This pulls the triangular reads (cholesky/lu/ludcmp) down from the bounding box's O(N**4) to the true O(N**3), while stencils keep their tight bounding box. resolve_minmax_over_range now divides out a positive common factor before the affine test, so Min(2*j, j*(i - j + 1)) reduces to comparing 2 vs i - j + 1; the accumulation iterates the resolution to a fixed point so nested Max/Min collapse inside-out. A residual Max/Min that no loop range fixes (e.g. a stencil's box vs its per-connector sum) is resolved by dominance at representative large sizes -- sound because either operand is a valid upper bound. symbolic: int_floor/int_ceil(x, -1) now fold to -x (parallel to the existing y == 1 rule and exact under the same integer assumption). A descending loop (step -1) writes its trip count as (1 - N) // (-1); this folds it to the clean N - 1 at construction, so the analyses report e.g. adi work 38*tsteps*(N-2)**2+40 and ludcmp work N*(N**2+2*N-2) instead of int_floor(1 - N, -1) forms. Re-pin the affected polybench expected values (concrete values unchanged except the cholesky/lu/ludcmp O(N**3) reads).

…name _edge_access_volume gathered the per-connector footprints by matching the array name, so for one access-node edge it summed every inner edge of that array -- including inner edges fed by a different connector (a second access node of the same array), counting the footprint once per connector. Associate each edge with the inner edges of its own connector via the IN_x <-> OUT_x pairing instead, and fall back to the bounding box for any connector outside that convention. Behavior is unchanged on the polybench corpus (each array enters its map through a single connector); the brute-force working-set ground truth still matches.

PYFUNC_TO_ARITHMETICS recognized only a subset of the common math intrinsics (exp, tanh, sin, cos, sqrt, atan2); tan, asin, acos, atan, sinh, cosh, log, log2, log10, exp2 and cbrt were unrecognized, so a tasklet using them had its transcendental work silently counted as zero (with an "unrecognized function" warning). Add them at one realised operation each, matching the existing op-count convention (np.* and math.* both lower to the bare C name). A test asserts each is now counted rather than dropped.

resolve_size_dominated_minmax: replace the explicit values/index/clear loop with min/max over the operands keyed on the probe value, collected into a set so a single-element set means all sample sizes agreed. Fold the gcd-factor guard in resolve_minmax_over_range into one try (drop the separate Integer(1) fallback). Behavior-preserving: polybench + total_volume tests and the brute-force working- set ground truth are unchanged.

The pinned EXPECTED values are self-referential (they pin what the analysis emits). Add a test that independently re-enumerates the distinct global-memory elements each kernel touches -- mirroring the kernel source under the cost model, without using the analysis or the pinned values -- and asserts the analyzed read/write bytes equal that true working-set traffic. Covers both regimes of the bounding-box-vs-per-connector Min: triangular row+column / row+row accesses (cholesky/lu/trisolv) and an overlapping stencil neighbourhood (jacobi-1d), at two sizes each (including sizes outside the pinned set).

ThrudPrimrose added 30 commits May 25, 2026 20:02

perf-analysis: drop accidental unittest import, narrow bare except

62a07db

Remove the stray 'from unittest import result' in work_depth_test and replace a bare 'except:' with 'except Exception' in the operational- intensity interstate-edge handling.

work-depth: handle symbolic map step in accumulate_over_range

c36697f

A symbolic step (e.g. a tile size) cannot be compared with > 0; decide the iteration direction from step.is_negative instead, treating an unknown-sign step as forward (map steps are never negative).

Apply pre-commit formatting (yapf, end-of-file, trailing whitespace)

1aef3e7

perf-analysis: tighten test docstrings and a comment

cd822be

Make the operational-intensity test docstrings concise and trim the worst-case branch comment, with no change to behavior.

work-depth: drop unused memlet locals in the BLAS counters

4ebd068

Remove the matmul/gemm/gemv/dot connector lookups whose results were never used, leaving only the memlets each counter actually reads.

perf-analysis: drop unused result of the concrete-path cfg_misses call

a399cf9

The operational intensity is collected into op_in_map by side effect, so the returned miss count was unused. Call cfg_misses without binding it.

ThrudPrimrose added 14 commits May 26, 2026 14:42

perf-analysis tests: convert via pystr_to_symbolic instead of sp.sympify

5706989

Use dace.symbolic.pystr_to_symbolic for the expected-value conversions in the comparison helpers, and compare the bail results against plain (0, 0).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Modernize perf analysis and memory volume#2387

[WIP] Modernize perf analysis and memory volume#2387
ThrudPrimrose wants to merge 44 commits into
mainfrom
modernize-perf-analysis-and-memory-volume

ThrudPrimrose commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ThrudPrimrose commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant