Per the 0.4.0 version description: "Unified benchmark framework at python -m bench (subcommands run, compare, plot, summary, list) with generator-driven probes in bench/_operations.py, peak-memory recording in bench/harness.py:measure_peak_memory, fixed seeds (0, 1, 2, 3), and a self-contained interactive Plotly dashboard at bench/_dashboard.py."
File to touch
bench/__main__.py
bench/_operations.py
bench/harness.py
bench/_probes.py
bench/_seeds.py
bench/_run.py
bench/_io.py
bench/_dashboard.py
Example to follow
tests/generators/ # generator-driven, per-seed/per-backend problem construction
tests/backend/_conformance.py # per-op harness + backend matrix layout
Task
Add a top-level bench package runnable as python -m bench exposing the
subcommands run, compare, plot, summary, and list.
- Define the probe abstraction in
bench/_probes.py (Probe, ProbeCase, a
registry) so each probe builds a fresh problem on a per-seed, per-backend
basis and declares its supported backends.
- Populate
bench/_operations.py with generator-driven probes spanning the same
surface as the test suite (space ops, LinOp apply/rapply/vapply, dense /
diagonal / sparse / identity / composed / summed). Importing the module
registers the probes.
- Pin the seed quartet
(0, 1, 2, 3) in bench/_seeds.py; every run iterates
exactly these seeds.
- Implement
measure_peak_memory(fn) in bench/harness.py returning peak
allocation for a zero-arg callable.
- Wire
bench/__main__.py to filter probes by --family/--match, run them
over the requested backends/devices, and emit a JSON artifact (bench/_io.py)
plus the verdict text; plot renders the self-contained interactive Plotly
dashboard in bench/_dashboard.py.
Done condition
The bench smoke suite passes, confirming the registry is populated, names are
unique, the seed quartet is the documented one, and each probe factory builds a
runnable NumPy case.
pytest tests/bench/test_bench_smoke.py -q
Mathematical note
Benchmarks measure performance, not correctness, so there is no numerical
invariant to hold. The only reproducibility invariant is determinism: a probe at
a fixed (seed, backend, size) must construct an identical problem instance on
every run — the seed quartet (0, 1, 2, 3) is the sole entropy source, and no
probe may read wall-clock time or unseeded RNG into the problem it builds.
Prerequisite level
File to touch
Example to follow
Task
Add a top-level
benchpackage runnable aspython -m benchexposing thesubcommands
run,compare,plot,summary, andlist.bench/_probes.py(Probe,ProbeCase, aregistry) so each probe builds a fresh problem on a per-seed, per-backendbasis and declares its supported backends.
bench/_operations.pywith generator-driven probes spanning the samesurface as the test suite (space ops, LinOp
apply/rapply/vapply, dense /diagonal / sparse / identity / composed / summed). Importing the module
registers the probes.
(0, 1, 2, 3)inbench/_seeds.py; every run iteratesexactly these seeds.
measure_peak_memory(fn)inbench/harness.pyreturning peakallocation for a zero-arg callable.
bench/__main__.pyto filter probes by--family/--match, run themover the requested backends/devices, and emit a JSON artifact (
bench/_io.py)plus the verdict text;
plotrenders the self-contained interactive Plotlydashboard in
bench/_dashboard.py.Done condition
The bench smoke suite passes, confirming the registry is populated, names are
unique, the seed quartet is the documented one, and each probe factory builds a
runnable NumPy case.
Mathematical note
Benchmarks measure performance, not correctness, so there is no numerical
invariant to hold. The only reproducibility invariant is determinism: a probe at
a fixed
(seed, backend, size)must construct an identical problem instance onevery run — the seed quartet
(0, 1, 2, 3)is the sole entropy source, and noprobe may read wall-clock time or unseeded RNG into the problem it builds.
Prerequisite level