An agent-driven loop that optimizes a model's inference latency on a specific GPU
without degrading output quality, and proves it. You point Claude Code at it, set a
/goal, and walk away.
Worked example: the Wan 2.2 VAE decoder on an H100, fp32 14.47s to 4.79s (3.02x), found blind in 18 experiments with quality held inside a frozen gate.
Writeup: https://www.heygen.com/research/auto-optim-ai-model-inference-speedup
- The agent edits exactly one file,
optimize.py(build_decoder()). - A frozen evaluator it cannot touch defines success:
harness/verify.py(output must match a frozen fp32 reference) plusharness/bench.py(latency). harness/run_experiment.pyverifies, benchmarks, then commits improvements and reverts regressions. The git history is the research record, and every kept result auto-runs a profiler so the next idea comes from data, not a guess.- Durable memory:
results.tsv,leaderboard.md,PROGRESS.md(journal + dead-ends).
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B Wan2.2_VAE.pth --local-dir ./assets
export CUDA_VISIBLE_DEVICES=<free_gpu>
python harness/make_latent.py # downloads a CC-BY clip, encodes the fixed input latent
python harness/make_reference.py # fp32 decode of it = the frozen quality target
python harness/run_experiment.py --desc "baseline fp32 eager"This repo is wired for Claude Code. CLAUDE.md (a symlink to AGENTS.md) is the
always-loaded contract; program.md is the full runbook, loaded on demand; a scoped
model-optimizer subagent does the edits; a Stop hook refuses to end a turn until the
experiment is logged and the gate is green.
Open the repo in Claude Code and hand it the wheel with a /goal:
/goal Drive Wan 2.2 VAE decode latency as low as it goes while harness/verify.py keeps
passing. Follow program.md: edit only optimize.py, run harness/run_experiment.py
after every change, and make each experiment attack what the latest profile shows.
Stop only when the best leaderboard.md time has not improved for 8 consecutive
profile-guided attempts AND the profile shows no bottleneck left worth attacking.
/goal re-checks that completion condition every turn, so the loop runs unattended
until it hits a real plateau rather than an arbitrary experiment count. Watch it climb
in leaderboard.md and PROGRESS.md. If it starts nibbling tiny gains, do not hand it
the answer: sharpen the rule in program.md (we added an anti-nibbling rule mid-run and
the plateau broke). Program the org, not the Python.
The fastest way is to let the agent do the wiring: open the repo in Claude Code and
ask it to retarget to your model. Nothing here is hardcoded to a VAE; the harness is
generic and the model-optimizer subagent is model-agnostic (the target lives in
goals.json). It will drop your model and a fixed input into model/ and assets/,
return your forward pass from optimize.py, and write goals.json for you. Then you
set the /goal and it runs the same loop.
By hand it is three swaps; everything else stays as is:
model/+assets/: your model, checkpoint, and one fixed input.optimize.py::build_decoder: return your(forward_fn, info).goals.json(metric, thresholds, stop condition), plus the comparison inharness/_common.pyif your quality metric differs.
Pin a random seed first, or run-to-run noise will read as a speedup.
- The shipped
optimize.pyis the winning config, tuned for a single H100. On other hardware treat it as a starting point and rerun the loop. - Benchmark input is Tears of Steel (c) Blender Foundation, mango.blender.org, CC-BY 3.0. No proprietary data is used.
- Heavy assets (
*.pth,*.pt,*.ckpt, downloaded video) are gitignored and generated locally by the setup steps above.
optimize.py the only file the agent edits goals.json frozen target + gate
AGENTS.md agent contract program.md loop runbook
harness/ frozen evaluator + driver + profiler .claude/agents/model-optimizer.md
model/ frozen model definition assets/ weights, latent, reference
results.tsv leaderboard.md PROGRESS.md the durable record