Skip to content

[WIP] Modernize perf analysis and memory volume#2387

Draft
ThrudPrimrose wants to merge 44 commits into
mainfrom
modernize-perf-analysis-and-memory-volume
Draft

[WIP] Modernize perf analysis and memory volume#2387
ThrudPrimrose wants to merge 44 commits into
mainfrom
modernize-perf-analysis-and-memory-volume

Conversation

@ThrudPrimrose

Copy link
Copy Markdown
Collaborator

Testing needs to be completed.

… memory-volume

Combines PR #2356 (work-depth + operational-intensity adapted to
ControlFlowRegions) with PR #2357 (symbolic memory-traffic analysis,
total_volume.py). Updates the work-depth Transpose import to its current
dace.libraries.linalg location (moved out of standard by #2317).
Move the 'import polybench' (and the absl-backed entry point) into each
kernel's '__main__' block so the modules can be imported as plain
kernel libraries for analysis/tests without requiring absl-py. Behavior
when run as a script is unchanged.
…ybench

Replace the TestAnalyzeSdfg class with module-level test_* functions, add
the copyright header and Sphinx docstrings, trim unused imports, and remove
the duplicated test_jacobi_1d (defined twice) and the inline jacobi kernels:
per-kernel volume checks on real kernels move to polybench_analysis_test.
Remove a stray sdfg.save('sdfg.sdfg') debug write, fix main() unpacking a
4-tuple from analyze_sdfg's 2-tuple return, correct invalid type-hint
syntax, replace bare 'except:' with 'except Exception', avoid mutable
default arguments, de-alias imports, and add the copyright header, module
docstring and Sphinx docstrings.
Remove the stray 'from unittest import result' in work_depth_test and
replace a bare 'except:' with 'except Exception' in the operational-
intensity interstate-edge handling.
Replace the incomplete hard-coded zero-work cast list with one derived from
the available DaCe data types plus Python/C builtins (so casts like double,
dace.uint16 no longer warn), and treat int_floor/int_ceil as one integer
division each.
Add an optimize parameter (default True) so callers can request the raw data
volume without auto-optimization (which can inject vectorization-tiling
artifacts). Route summations through safe_summation, which strips the
positive assumptions that make SymPy's posify raise on DaCe size symbols,
so triangular-loop kernels (cholesky/lu/ludcmp) no longer crash.
Map scopes multiplied body work by the iteration count, which is only
correct when the work is independent of the iteration variable; for
triangular nests (an inner map bound by an outer parameter, e.g. syrk,
symm, trmm, correlation, covariance) this left the outer index free in the
result. Introduce a shared accumulate_over_range helper that summates work
(and depth, for loops) over the iteration domain -- reducing to a
multiplication when independent -- and use it from both the loop and map
handlers so they accumulate identically.
A symbolic step (e.g. a tile size) cannot be compared with > 0; decide the
iteration direction from step.is_negative instead, treating an
unknown-sign step as forward (map steps are never negative).
A symbol read location may be any ControlFlowBlock (an SDFGState, or a loop
or conditional region reading the symbol in its condition/bounds), not only
an SDFGState or interstate edge. The scope-coarsening check assumed the
non-state case was always an edge and dereferenced .src, raising on a
LoopRegion. Reduce any control-flow block to itself and only edges to their
source block before testing reachability.
Interstate-edge assignments mix value computation (e.g. the bitwise updates
to a CRC kernel's loop-carried scalars) with pure addressing (index/bound
helpers). Add compute_symbols, which classifies a symbol as compute when it
is read inside a tasklet or transitively feeds such a symbol, and count an
edge assignment's arithmetic toward work only for compute symbols, so
addressing arithmetic no longer inflates the work estimate.
Add leading-order flop/depth counters for the linalg solver library nodes:
Cholesky ~N^3/3, Inv ~2N^3, Solve ~2N^3/3 + 2N^2 per right-hand side, each
with an O(N) sequential factorization depth. These apply to the library node
directly (e.g. when auto-optimization keeps it), complementing the existing
matmul/gemm/gemv/dot counters.
…nels

Add polybench_analysis_test, which reuses the canonical tests/polybench
kernels (no duplicate definitions) and pins the compute work and (un-
optimized) memory volume of all 30, with the optimized memory path
exercised for robustness. Also covers the linalg solver library-node
counters, the compute-vs-address classification of interstate-edge
arithmetic, the loop-carried integer-compute case, and the data-dependent
sparse matrix-vector known limitation.
Resolve Max/Min nodes in the per-iteration volume when the summation bound
fixes the comparison sign (e.g. a triangular nest where the inner index is
bounded by the outer one), so cholesky/lu/ludcmp produce a closed-form read
volume instead of an unevaluated sum. SymPy's refine cannot do this, and the
same-named symbols carry inconsistent assumptions, so the resolver compares
in a canonical namespace. Create symbols via dace.symbolic.symbol rather than
sympy.Symbol throughout, classify interstate edges with isinstance instead of
getattr, and pin the now-closed cholesky/lu/ludcmp volumes in the test.
…p-cost test

Spell out in the total_volume module docstring and the polybench test where the
bytes-moved and flop figures come from: total_volume counts each accessed region
once per enclosing map nest and multiplies by enclosing sequential-loop trip
counts (reuse within a parallel nest, flush across loops -- not the I/O-optimal
IOLB lower bound), and the flop count is work-depth's realised operation count
(cross-referenced with PolyBench/Python CC'21 and polybench_set_program_flops).
Add a test showing a user can override a function's flop cost (sin 1 -> 65).
…dd docstrings

Replace the manual 'next(e for e in state.in_edges(node) if e.dst_conn==...)'
connector lookups in the work-depth library-node counters with the canonical
SDFGState.in_edges_by_connector / out_edges_by_connector API. Derive
total_volume.get_static_symbols' type-cast list from dace.dtypes (matched
longest-first) instead of hard-coding it. Add concise docstrings to the
matmul/reduce/dot/cholesky/inv/solve counters.
The control-flow-region and inlined forms produce identical work, differing
only in the auto-generated loop-execution-count symbol name
(num_execs_<cfg>_<node>), which is structure-dependent. Canonicalize that
symbol in the standardize comparison helper so the two forms compare equal,
drop the now-unnecessary skips on the *_inlined variants, and remove a stray
sdfg.save debug write.
Replace diagnostic print() calls in the work-depth and operational-intensity
analyses with warnings.warn (unrecognized tasklet function, loop fallback,
analysis-failed, only-map-scopes, no-merge-state, uncommon iedge assignment,
high MAPE), remove debug prints (the legacy-loop trace and the interactive
read/write-set dump), and drop the dead AccessStack.debug_print method. CLI
main() result output and the interactive ask_user prompts are kept.
Replace the traceback.print_exc debug dump in the operational-intensity
cache-miss helper with warnings.warn (and drop the now-unused traceback
import), narrow a bare except in helpers.get_static_symbols, move the cfg
import in operational_intensity to module scope, give the lazy matplotlib
import in op_in_helpers.plot a reason comment, and remove a dead commented
assignment in work_depth.
Two bugs in the ConditionalBlock handling of cfr_misses, which implements
the interactive ask_user branch picker:

- Picking the implicit-else option crashed: a stray reassignment replaced
  the real else-branch region object with the literal string 'else_branch',
  which has no start_block. Removed it; the surviving assignment already
  carries the real region and the prompt already renders "else_branch".

- A branch chosen once was not reused on later visits of the same
  conditional. The documented behavior (reuse the prior decision) was
  unimplemented, so a conditional inside a loop fell through to the
  worst-case path that explores all branches and takes the max every
  iteration. Added the decided-branch reuse case.

Adds two tests: branch selection honors the user's choice (different-work
branches yield different intensities), and a conditional inside a loop
prompts once and applies the chosen branch on every iteration.
The combined PR introduced two copies of get_static_symbols and
subs_till_fixed_point: an older one in helpers.py (used by work_depth and
operational_intensity) with a hard-coded type-name list and raw sympy, and an
improved one in total_volume.py deriving the type names from dace.dtypes and
using dace.symbolic. Keep a single improved implementation in helpers.py (the
shared module) and import it from total_volume.

get_static_symbols keeps returning string-keyed mappings, which both
substitution (expr.subs) and by-name indexing (mapping[loop_var] in
operational_intensity) rely on, so all three analyses now share the better
type-cast handling with no change to their contracts.
Cover the simulation-based path of analyze_sdfg_op_in: giving a symbol a range
'start,stop,step' samples it, simulates cache misses for each sample, and fits
the operational intensity as a function of that symbol. For the streaming
single_map64 kernel the fitted intensity is the constant 1/24, which the test
checks at several sizes.
When control flow is inlined to plain conditional interstate edges (no
ConditionalBlock), an undecidable data-dependent branch is handled by the
legacy traversal in cfg_misses. The non-ask_user path was broken: it referenced
an uninitialized curr_misses (crashing with UnboundLocalError) and computed a
final_misses/final_e that were discarded, so it never actually accounted the
chosen branch.

Rewrite it to mirror the ConditionalBlock handling: analyze every candidate
branch on copies of the mapping and stack, then continue along the worst-case
(most misses) one, committing its state before resuming at the merge state.

Adds a test that inlines a branchy program and checks the legacy traversal
yields the same operational intensity as its ConditionalBlock form.
Make the operational-intensity test docstrings concise and trim the worst-case
branch comment, with no change to behavior.
Replace the hand-rolled BFS in find_states_between with a light wrapper around
the existing dace.sdfg.utils.nodes_in_all_simple_paths utility, and drop the now
unused deque import. find_merge_state already used cfg.branch_merges; correct
its stale docstring to describe that instead of an unrelated function.
Remove the matmul/gemm/gemv/dot connector lookups whose results were never
used, leaving only the memlets each counter actually reads.
The operational intensity is collected into op_in_map by side effect, so the
returned miss count was unused. Call cfg_misses without binding it.
The analyses now require structured control flow (loops as LoopRegion, branches
as ConditionalBlock) and do not model early loop exits. This removes a large
amount of legacy machinery:

- helpers.py: drop the legacy loop-detection cluster (get_domtree, get_backedges,
  NodeCycle, LoopExtractionError, find_loop_guards_tails_exits, get_legacy_loop_body,
  get_legacy_loop_ranges).
- work_depth.py: drop the "legacy loops" section that summed and broke cycles.
- operational_intensity.py: drop the legacy interstate-branch handling
  (find_merge_state, find_states_between, mem_accesses_on_path and the
  candidate/worst-case block); the state walk now follows the single structured
  successor.

In their place, a shared helpers.has_unstructured_control_flow detects a legacy
loop (a cycle outside a LoopRegion), unstructured branching (a block with more
than one outgoing edge), or break/continue (BreakBlock / ContinueBlock). Each of
the three analyses checks it up front and, if unstructured, warns and returns a
zero result rather than a wrong one.

Tests: replace the inlined work-depth/op-intensity tests (which exercised the
removed legacy paths) with tests asserting the warning and zero result on
inlined SDFGs and on break/continue kernels; add the same for memory volume.
return is a non-local exit the analyses do not model, so treat ReturnBlock like
break/continue in has_unstructured_control_flow. With break/continue/return all
caught by the up-front bail, the (ReturnBlock, ContinueBlock, BreakBlock) ->
zero-cost branches in work-depth, operational-intensity and memory-volume are
unreachable; remove them and the now-unused imports.

Extend the work-depth bail test to cover an early-return kernel.
…tract

Fix the stale control_flow_region_work_depth docstring (it no longer breaks
loops; structured regions are already DAGs) and a ControlFlowRegion typo, and
note on each public entry point (work-depth, operational-intensity,
memory-volume) that only structured control flow is supported and unstructured
input yields a warned zero result.
…h import, no banners, test reuse)

- op_in_helpers: drop the unnecessary `from __future__ import annotations`; quote the one
  self-referential annotation instead.
- tests/polybench/{2mm,3mm,floyd-warshall,gemm,jacobi-2d}: replace the try/except import-polybench
  fallback with a plain `import polybench`.
- polybench_analysis_test: replace dashed banner comments with plain comments.
- total_volume_test: factor the duplicated LoopRegion construction into make_loop_volume_sdfg.
… int_floor

Replace the production sp.Symbol creation sites (loop var, num_execs, library-node work/depth/misses
placeholders) with dace.symbolic.symbol, and the parsed array-index symbols with pystr_to_symbolic
(which handles compound indices and emits dace symbols). Use dace.symbolic.int_floor for the symbolic
loop-range divisions in accumulate_over_range and scope_volume. An unbounded loop may have no loop
variable, so guard symbol creation (None then forces the execution-count fallback).
… test normalizers

Convert the symbol-creation sites in the assumption engine (assumptions.py) and the assumption-test
fixtures to dace.symbolic.symbol, so symbols are dace symbols everywhere. Fix the equal-symbol
self-reference check to compare by name (DaCe symbols are not interned, so the old `is` identity
check no longer held). With symbols now consistent, remove the test-side assumption-stripping
(standardize / reps) and rebuild the auxiliary expected symbols (num_execs, _p_i, Reduce_misses) as
the exact dace symbols the analyses emit. Also use `import dace` rather than `import dace as dc`.
Use dace.symbolic.pystr_to_symbolic for the expected-value conversions in the comparison helpers, and
compare the bail results against plain (0, 0).
…) instead of raw sympy

Replace sp.sympify with dace.symbolic.pystr_to_symbolic and sp.simplify with dace.symbolic.simplify
across the analyses and tests. The remaining sympy uses (Max/Min/log/Sum/Abs/Piecewise/oo/N and the
isinstance(x, sp.Symbol/sp.Basic) checks and sp.Expr type hints) have no dace.symbolic equivalent and
stay. Standardize the dace.symbolic imports to `from dace.symbolic import <names>` across all changed
files (renaming the colliding local `symbol` variable in assumptions.parse_assumptions to `lhs`).
operational_intensity: factor the identical in-/out-edge cache-miss accounting in scope_misses into a
single _edge_miss helper, and collapse update_map_iterators' two near-identical SymExpr/non-SymExpr
branches into one.

total_volume: factor scope_volume's symmetric read/write handling into _access_volume (global-memory
edge byte volume) and _accumulate_volume_over_var (sum a per-iteration volume over one loop/map var).
AccessStack.copy / AccessStack.in_cache_as_list / CacheLineTracker.copy had no callers left after the
unstructured-control-flow branch exploration was removed (the conditional handler deep-copies the
stack instead). Drop them and the now-unused deque import.
…e polybench

total_volume: resolve_minmax_over_range unpacked the canonicalized Max/Min
assuming two arguments, but canonicalization can collapse it (e.g. Max(j, j-1)
-> j); guard the unpack and map a collapsed node to its surviving operand. The
resulting exception was swallowed by cfr_volume's loop fallback while a
range-var-stack frame was already pushed, leaking the frame and multiplying a
sibling loop nest's volume by a spurious range -- this turned lu/ludcmp read
volume into a spurious O(N**5) (and write into O(N**3)). Pop the frame in
finally so a failure in the recursion cannot leak it. lu/ludcmp now report the
O(N**4) bounding-box read and O(N**2) write.

polybench_analysis_test: pin each kernel's work/read/write as a (symbolic,
value-at-_SIZES) pair side-by-side and assert the analysis matches the closed
form symbolically (not only at one size); re-pin lu/ludcmp to the corrected
volumes.
…l(x, -1)

total_volume: estimate a map's accessed region as the tighter (Min) of the
propagated boundary memlet (a bounding box, which over-counts the gap between
disjoint slices such as a row and a column of a triangular access) and the sum
of the per-connector footprints (each inner edge propagated over the map on its
own, which over-counts overlapping slices such as a stencil's neighbourhood).
Both are valid upper bounds on the working set, so their Min is a tighter valid
bound -- and exact for the polybench corpus. This pulls the triangular reads
(cholesky/lu/ludcmp) down from the bounding box's O(N**4) to the true O(N**3),
while stencils keep their tight bounding box.

resolve_minmax_over_range now divides out a positive common factor before the
affine test, so Min(2*j, j*(i - j + 1)) reduces to comparing 2 vs i - j + 1; the
accumulation iterates the resolution to a fixed point so nested Max/Min collapse
inside-out. A residual Max/Min that no loop range fixes (e.g. a stencil's box vs
its per-connector sum) is resolved by dominance at representative large sizes --
sound because either operand is a valid upper bound.

symbolic: int_floor/int_ceil(x, -1) now fold to -x (parallel to the existing
y == 1 rule and exact under the same integer assumption). A descending loop
(step -1) writes its trip count as (1 - N) // (-1); this folds it to the clean
N - 1 at construction, so the analyses report e.g. adi work 38*tsteps*(N-2)**2+40
and ludcmp work N*(N**2+2*N-2) instead of int_floor(1 - N, -1) forms.

Re-pin the affected polybench expected values (concrete values unchanged except
the cholesky/lu/ludcmp O(N**3) reads).
…name

_edge_access_volume gathered the per-connector footprints by matching the array
name, so for one access-node edge it summed every inner edge of that array --
including inner edges fed by a different connector (a second access node of the
same array), counting the footprint once per connector. Associate each edge with
the inner edges of its own connector via the IN_x <-> OUT_x pairing instead, and
fall back to the bounding box for any connector outside that convention. Behavior
is unchanged on the polybench corpus (each array enters its map through a single
connector); the brute-force working-set ground truth still matches.
PYFUNC_TO_ARITHMETICS recognized only a subset of the common math intrinsics
(exp, tanh, sin, cos, sqrt, atan2); tan, asin, acos, atan, sinh, cosh, log,
log2, log10, exp2 and cbrt were unrecognized, so a tasklet using them had its
transcendental work silently counted as zero (with an "unrecognized function"
warning). Add them at one realised operation each, matching the existing
op-count convention (np.* and math.* both lower to the bare C name). A test
asserts each is now counted rather than dropped.
resolve_size_dominated_minmax: replace the explicit values/index/clear loop with
min/max over the operands keyed on the probe value, collected into a set so a
single-element set means all sample sizes agreed. Fold the gcd-factor guard in
resolve_minmax_over_range into one try (drop the separate Integer(1) fallback).
Behavior-preserving: polybench + total_volume tests and the brute-force working-
set ground truth are unchanged.
The pinned EXPECTED values are self-referential (they pin what the analysis
emits). Add a test that independently re-enumerates the distinct global-memory
elements each kernel touches -- mirroring the kernel source under the cost model,
without using the analysis or the pinned values -- and asserts the analyzed
read/write bytes equal that true working-set traffic. Covers both regimes of the
bounding-box-vs-per-connector Min: triangular row+column / row+row accesses
(cholesky/lu/trisolv) and an overlapping stencil neighbourhood (jacobi-1d), at
two sizes each (including sizes outside the pinned set).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant