[WIP] Modernize perf analysis and memory volume#2387
Draft
ThrudPrimrose wants to merge 44 commits into
Draft
Conversation
Move the 'import polybench' (and the absl-backed entry point) into each kernel's '__main__' block so the modules can be imported as plain kernel libraries for analysis/tests without requiring absl-py. Behavior when run as a script is unchanged.
…ybench Replace the TestAnalyzeSdfg class with module-level test_* functions, add the copyright header and Sphinx docstrings, trim unused imports, and remove the duplicated test_jacobi_1d (defined twice) and the inline jacobi kernels: per-kernel volume checks on real kernels move to polybench_analysis_test.
Remove a stray sdfg.save('sdfg.sdfg') debug write, fix main() unpacking a
4-tuple from analyze_sdfg's 2-tuple return, correct invalid type-hint
syntax, replace bare 'except:' with 'except Exception', avoid mutable
default arguments, de-alias imports, and add the copyright header, module
docstring and Sphinx docstrings.
Remove the stray 'from unittest import result' in work_depth_test and replace a bare 'except:' with 'except Exception' in the operational- intensity interstate-edge handling.
Replace the incomplete hard-coded zero-work cast list with one derived from the available DaCe data types plus Python/C builtins (so casts like double, dace.uint16 no longer warn), and treat int_floor/int_ceil as one integer division each.
Add an optimize parameter (default True) so callers can request the raw data volume without auto-optimization (which can inject vectorization-tiling artifacts). Route summations through safe_summation, which strips the positive assumptions that make SymPy's posify raise on DaCe size symbols, so triangular-loop kernels (cholesky/lu/ludcmp) no longer crash.
Map scopes multiplied body work by the iteration count, which is only correct when the work is independent of the iteration variable; for triangular nests (an inner map bound by an outer parameter, e.g. syrk, symm, trmm, correlation, covariance) this left the outer index free in the result. Introduce a shared accumulate_over_range helper that summates work (and depth, for loops) over the iteration domain -- reducing to a multiplication when independent -- and use it from both the loop and map handlers so they accumulate identically.
A symbolic step (e.g. a tile size) cannot be compared with > 0; decide the iteration direction from step.is_negative instead, treating an unknown-sign step as forward (map steps are never negative).
A symbol read location may be any ControlFlowBlock (an SDFGState, or a loop or conditional region reading the symbol in its condition/bounds), not only an SDFGState or interstate edge. The scope-coarsening check assumed the non-state case was always an edge and dereferenced .src, raising on a LoopRegion. Reduce any control-flow block to itself and only edges to their source block before testing reachability.
Interstate-edge assignments mix value computation (e.g. the bitwise updates to a CRC kernel's loop-carried scalars) with pure addressing (index/bound helpers). Add compute_symbols, which classifies a symbol as compute when it is read inside a tasklet or transitively feeds such a symbol, and count an edge assignment's arithmetic toward work only for compute symbols, so addressing arithmetic no longer inflates the work estimate.
Add leading-order flop/depth counters for the linalg solver library nodes: Cholesky ~N^3/3, Inv ~2N^3, Solve ~2N^3/3 + 2N^2 per right-hand side, each with an O(N) sequential factorization depth. These apply to the library node directly (e.g. when auto-optimization keeps it), complementing the existing matmul/gemm/gemv/dot counters.
…nels Add polybench_analysis_test, which reuses the canonical tests/polybench kernels (no duplicate definitions) and pins the compute work and (un- optimized) memory volume of all 30, with the optimized memory path exercised for robustness. Also covers the linalg solver library-node counters, the compute-vs-address classification of interstate-edge arithmetic, the loop-carried integer-compute case, and the data-dependent sparse matrix-vector known limitation.
Resolve Max/Min nodes in the per-iteration volume when the summation bound fixes the comparison sign (e.g. a triangular nest where the inner index is bounded by the outer one), so cholesky/lu/ludcmp produce a closed-form read volume instead of an unevaluated sum. SymPy's refine cannot do this, and the same-named symbols carry inconsistent assumptions, so the resolver compares in a canonical namespace. Create symbols via dace.symbolic.symbol rather than sympy.Symbol throughout, classify interstate edges with isinstance instead of getattr, and pin the now-closed cholesky/lu/ludcmp volumes in the test.
…p-cost test Spell out in the total_volume module docstring and the polybench test where the bytes-moved and flop figures come from: total_volume counts each accessed region once per enclosing map nest and multiplies by enclosing sequential-loop trip counts (reuse within a parallel nest, flush across loops -- not the I/O-optimal IOLB lower bound), and the flop count is work-depth's realised operation count (cross-referenced with PolyBench/Python CC'21 and polybench_set_program_flops). Add a test showing a user can override a function's flop cost (sin 1 -> 65).
…dd docstrings Replace the manual 'next(e for e in state.in_edges(node) if e.dst_conn==...)' connector lookups in the work-depth library-node counters with the canonical SDFGState.in_edges_by_connector / out_edges_by_connector API. Derive total_volume.get_static_symbols' type-cast list from dace.dtypes (matched longest-first) instead of hard-coding it. Add concise docstrings to the matmul/reduce/dot/cholesky/inv/solve counters.
The control-flow-region and inlined forms produce identical work, differing only in the auto-generated loop-execution-count symbol name (num_execs_<cfg>_<node>), which is structure-dependent. Canonicalize that symbol in the standardize comparison helper so the two forms compare equal, drop the now-unnecessary skips on the *_inlined variants, and remove a stray sdfg.save debug write.
Replace diagnostic print() calls in the work-depth and operational-intensity analyses with warnings.warn (unrecognized tasklet function, loop fallback, analysis-failed, only-map-scopes, no-merge-state, uncommon iedge assignment, high MAPE), remove debug prints (the legacy-loop trace and the interactive read/write-set dump), and drop the dead AccessStack.debug_print method. CLI main() result output and the interactive ask_user prompts are kept.
Replace the traceback.print_exc debug dump in the operational-intensity cache-miss helper with warnings.warn (and drop the now-unused traceback import), narrow a bare except in helpers.get_static_symbols, move the cfg import in operational_intensity to module scope, give the lazy matplotlib import in op_in_helpers.plot a reason comment, and remove a dead commented assignment in work_depth.
Two bugs in the ConditionalBlock handling of cfr_misses, which implements the interactive ask_user branch picker: - Picking the implicit-else option crashed: a stray reassignment replaced the real else-branch region object with the literal string 'else_branch', which has no start_block. Removed it; the surviving assignment already carries the real region and the prompt already renders "else_branch". - A branch chosen once was not reused on later visits of the same conditional. The documented behavior (reuse the prior decision) was unimplemented, so a conditional inside a loop fell through to the worst-case path that explores all branches and takes the max every iteration. Added the decided-branch reuse case. Adds two tests: branch selection honors the user's choice (different-work branches yield different intensities), and a conditional inside a loop prompts once and applies the chosen branch on every iteration.
The combined PR introduced two copies of get_static_symbols and subs_till_fixed_point: an older one in helpers.py (used by work_depth and operational_intensity) with a hard-coded type-name list and raw sympy, and an improved one in total_volume.py deriving the type names from dace.dtypes and using dace.symbolic. Keep a single improved implementation in helpers.py (the shared module) and import it from total_volume. get_static_symbols keeps returning string-keyed mappings, which both substitution (expr.subs) and by-name indexing (mapping[loop_var] in operational_intensity) rely on, so all three analyses now share the better type-cast handling with no change to their contracts.
Cover the simulation-based path of analyze_sdfg_op_in: giving a symbol a range 'start,stop,step' samples it, simulates cache misses for each sample, and fits the operational intensity as a function of that symbol. For the streaming single_map64 kernel the fitted intensity is the constant 1/24, which the test checks at several sizes.
When control flow is inlined to plain conditional interstate edges (no ConditionalBlock), an undecidable data-dependent branch is handled by the legacy traversal in cfg_misses. The non-ask_user path was broken: it referenced an uninitialized curr_misses (crashing with UnboundLocalError) and computed a final_misses/final_e that were discarded, so it never actually accounted the chosen branch. Rewrite it to mirror the ConditionalBlock handling: analyze every candidate branch on copies of the mapping and stack, then continue along the worst-case (most misses) one, committing its state before resuming at the merge state. Adds a test that inlines a branchy program and checks the legacy traversal yields the same operational intensity as its ConditionalBlock form.
Make the operational-intensity test docstrings concise and trim the worst-case branch comment, with no change to behavior.
Replace the hand-rolled BFS in find_states_between with a light wrapper around the existing dace.sdfg.utils.nodes_in_all_simple_paths utility, and drop the now unused deque import. find_merge_state already used cfg.branch_merges; correct its stale docstring to describe that instead of an unrelated function.
Remove the matmul/gemm/gemv/dot connector lookups whose results were never used, leaving only the memlets each counter actually reads.
The operational intensity is collected into op_in_map by side effect, so the returned miss count was unused. Call cfg_misses without binding it.
The analyses now require structured control flow (loops as LoopRegion, branches as ConditionalBlock) and do not model early loop exits. This removes a large amount of legacy machinery: - helpers.py: drop the legacy loop-detection cluster (get_domtree, get_backedges, NodeCycle, LoopExtractionError, find_loop_guards_tails_exits, get_legacy_loop_body, get_legacy_loop_ranges). - work_depth.py: drop the "legacy loops" section that summed and broke cycles. - operational_intensity.py: drop the legacy interstate-branch handling (find_merge_state, find_states_between, mem_accesses_on_path and the candidate/worst-case block); the state walk now follows the single structured successor. In their place, a shared helpers.has_unstructured_control_flow detects a legacy loop (a cycle outside a LoopRegion), unstructured branching (a block with more than one outgoing edge), or break/continue (BreakBlock / ContinueBlock). Each of the three analyses checks it up front and, if unstructured, warns and returns a zero result rather than a wrong one. Tests: replace the inlined work-depth/op-intensity tests (which exercised the removed legacy paths) with tests asserting the warning and zero result on inlined SDFGs and on break/continue kernels; add the same for memory volume.
return is a non-local exit the analyses do not model, so treat ReturnBlock like break/continue in has_unstructured_control_flow. With break/continue/return all caught by the up-front bail, the (ReturnBlock, ContinueBlock, BreakBlock) -> zero-cost branches in work-depth, operational-intensity and memory-volume are unreachable; remove them and the now-unused imports. Extend the work-depth bail test to cover an early-return kernel.
…tract Fix the stale control_flow_region_work_depth docstring (it no longer breaks loops; structured regions are already DAGs) and a ControlFlowRegion typo, and note on each public entry point (work-depth, operational-intensity, memory-volume) that only structured control flow is supported and unstructured input yields a warned zero result.
…h import, no banners, test reuse)
- op_in_helpers: drop the unnecessary `from __future__ import annotations`; quote the one
self-referential annotation instead.
- tests/polybench/{2mm,3mm,floyd-warshall,gemm,jacobi-2d}: replace the try/except import-polybench
fallback with a plain `import polybench`.
- polybench_analysis_test: replace dashed banner comments with plain comments.
- total_volume_test: factor the duplicated LoopRegion construction into make_loop_volume_sdfg.
… int_floor Replace the production sp.Symbol creation sites (loop var, num_execs, library-node work/depth/misses placeholders) with dace.symbolic.symbol, and the parsed array-index symbols with pystr_to_symbolic (which handles compound indices and emits dace symbols). Use dace.symbolic.int_floor for the symbolic loop-range divisions in accumulate_over_range and scope_volume. An unbounded loop may have no loop variable, so guard symbol creation (None then forces the execution-count fallback).
… test normalizers Convert the symbol-creation sites in the assumption engine (assumptions.py) and the assumption-test fixtures to dace.symbolic.symbol, so symbols are dace symbols everywhere. Fix the equal-symbol self-reference check to compare by name (DaCe symbols are not interned, so the old `is` identity check no longer held). With symbols now consistent, remove the test-side assumption-stripping (standardize / reps) and rebuild the auxiliary expected symbols (num_execs, _p_i, Reduce_misses) as the exact dace symbols the analyses emit. Also use `import dace` rather than `import dace as dc`.
Use dace.symbolic.pystr_to_symbolic for the expected-value conversions in the comparison helpers, and compare the bail results against plain (0, 0).
…) instead of raw sympy Replace sp.sympify with dace.symbolic.pystr_to_symbolic and sp.simplify with dace.symbolic.simplify across the analyses and tests. The remaining sympy uses (Max/Min/log/Sum/Abs/Piecewise/oo/N and the isinstance(x, sp.Symbol/sp.Basic) checks and sp.Expr type hints) have no dace.symbolic equivalent and stay. Standardize the dace.symbolic imports to `from dace.symbolic import <names>` across all changed files (renaming the colliding local `symbol` variable in assumptions.parse_assumptions to `lhs`).
operational_intensity: factor the identical in-/out-edge cache-miss accounting in scope_misses into a single _edge_miss helper, and collapse update_map_iterators' two near-identical SymExpr/non-SymExpr branches into one. total_volume: factor scope_volume's symmetric read/write handling into _access_volume (global-memory edge byte volume) and _accumulate_volume_over_var (sum a per-iteration volume over one loop/map var).
AccessStack.copy / AccessStack.in_cache_as_list / CacheLineTracker.copy had no callers left after the unstructured-control-flow branch exploration was removed (the conditional handler deep-copies the stack instead). Drop them and the now-unused deque import.
…e polybench total_volume: resolve_minmax_over_range unpacked the canonicalized Max/Min assuming two arguments, but canonicalization can collapse it (e.g. Max(j, j-1) -> j); guard the unpack and map a collapsed node to its surviving operand. The resulting exception was swallowed by cfr_volume's loop fallback while a range-var-stack frame was already pushed, leaking the frame and multiplying a sibling loop nest's volume by a spurious range -- this turned lu/ludcmp read volume into a spurious O(N**5) (and write into O(N**3)). Pop the frame in finally so a failure in the recursion cannot leak it. lu/ludcmp now report the O(N**4) bounding-box read and O(N**2) write. polybench_analysis_test: pin each kernel's work/read/write as a (symbolic, value-at-_SIZES) pair side-by-side and assert the analysis matches the closed form symbolically (not only at one size); re-pin lu/ludcmp to the corrected volumes.
…l(x, -1) total_volume: estimate a map's accessed region as the tighter (Min) of the propagated boundary memlet (a bounding box, which over-counts the gap between disjoint slices such as a row and a column of a triangular access) and the sum of the per-connector footprints (each inner edge propagated over the map on its own, which over-counts overlapping slices such as a stencil's neighbourhood). Both are valid upper bounds on the working set, so their Min is a tighter valid bound -- and exact for the polybench corpus. This pulls the triangular reads (cholesky/lu/ludcmp) down from the bounding box's O(N**4) to the true O(N**3), while stencils keep their tight bounding box. resolve_minmax_over_range now divides out a positive common factor before the affine test, so Min(2*j, j*(i - j + 1)) reduces to comparing 2 vs i - j + 1; the accumulation iterates the resolution to a fixed point so nested Max/Min collapse inside-out. A residual Max/Min that no loop range fixes (e.g. a stencil's box vs its per-connector sum) is resolved by dominance at representative large sizes -- sound because either operand is a valid upper bound. symbolic: int_floor/int_ceil(x, -1) now fold to -x (parallel to the existing y == 1 rule and exact under the same integer assumption). A descending loop (step -1) writes its trip count as (1 - N) // (-1); this folds it to the clean N - 1 at construction, so the analyses report e.g. adi work 38*tsteps*(N-2)**2+40 and ludcmp work N*(N**2+2*N-2) instead of int_floor(1 - N, -1) forms. Re-pin the affected polybench expected values (concrete values unchanged except the cholesky/lu/ludcmp O(N**3) reads).
…name _edge_access_volume gathered the per-connector footprints by matching the array name, so for one access-node edge it summed every inner edge of that array -- including inner edges fed by a different connector (a second access node of the same array), counting the footprint once per connector. Associate each edge with the inner edges of its own connector via the IN_x <-> OUT_x pairing instead, and fall back to the bounding box for any connector outside that convention. Behavior is unchanged on the polybench corpus (each array enters its map through a single connector); the brute-force working-set ground truth still matches.
PYFUNC_TO_ARITHMETICS recognized only a subset of the common math intrinsics (exp, tanh, sin, cos, sqrt, atan2); tan, asin, acos, atan, sinh, cosh, log, log2, log10, exp2 and cbrt were unrecognized, so a tasklet using them had its transcendental work silently counted as zero (with an "unrecognized function" warning). Add them at one realised operation each, matching the existing op-count convention (np.* and math.* both lower to the bare C name). A test asserts each is now counted rather than dropped.
resolve_size_dominated_minmax: replace the explicit values/index/clear loop with min/max over the operands keyed on the probe value, collected into a set so a single-element set means all sample sizes agreed. Fold the gcd-factor guard in resolve_minmax_over_range into one try (drop the separate Integer(1) fallback). Behavior-preserving: polybench + total_volume tests and the brute-force working- set ground truth are unchanged.
The pinned EXPECTED values are self-referential (they pin what the analysis emits). Add a test that independently re-enumerates the distinct global-memory elements each kernel touches -- mirroring the kernel source under the cost model, without using the analysis or the pinned values -- and asserts the analyzed read/write bytes equal that true working-set traffic. Covers both regimes of the bounding-box-vs-per-connector Min: triangular row+column / row+row accesses (cholesky/lu/trisolv) and an overlapping stencil neighbourhood (jacobi-1d), at two sizes each (including sizes outside the pinned set).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Testing needs to be completed.