Goal
The killer feature. Pick a representative subset of the user's past sessions, replay each one against N candidate models, score each replay deterministically, surface "for your kind of work, model X wins at $Y per outcome." Move the answer from vendor folklore to empirical fact.
Why now
Every team is overpaying for the wrong model on the wrong task. Nobody can prove it. Owning the empirical answer is the most defensible product position the project can take. Mode recommender (Spec 18) is the heuristic v1 of this; this is the v2.
Schema
v022 — benchmark_runs + benchmark_outcomes:
CREATE TABLE benchmark_runs (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL, -- user-chosen run label
rubric_version INTEGER NOT NULL,
candidate_models_json TEXT NOT NULL, -- ['claude-opus-4-7', 'gemini-3-pro', ...]
source_session_ids_json TEXT NOT NULL,-- the sessions sampled from history
status TEXT NOT NULL, -- 'queued' | 'running' | 'complete' | 'failed'
created_ts TEXT NOT NULL,
completed_ts TEXT,
total_cost_usd REAL DEFAULT 0
);
CREATE TABLE benchmark_outcomes (
benchmark_run_id INTEGER NOT NULL REFERENCES benchmark_runs(id),
source_session_id TEXT NOT NULL,
candidate_model TEXT NOT NULL,
fork_session_id TEXT, -- references session_forks (Spec 25)
status TEXT NOT NULL, -- 'pending' | 'running' | 'complete' | 'failed'
-- scoring (combination of Spec 21 static + Spec 23 LLM grader):
static_score REAL, -- [0, 1]
llm_score REAL,
combined_score REAL,
cost_usd REAL,
duration_seconds INTEGER,
num_turns INTEGER,
evidence_json TEXT,
PRIMARY KEY (benchmark_run_id, source_session_id, candidate_model)
);
User-visible surface
- CLI:
stackunderflow benchmark run --name "may-2026" --models claude-opus-4-7,claude-sonnet-4-5,gemini-3-pro --sample-size 10 [--budget-cap-usd 5].
- CLI:
stackunderflow benchmark show <name> — leaderboard table.
- CLI:
stackunderflow benchmark recommend [--task-pattern "fix-bug" | "refactor" | "build-new"] — uses past benchmark data to recommend a model for a new prompt.
- API:
POST /api/benchmark/runs, GET /api/benchmark/runs/{id}, GET /api/benchmark/recommend.
- Meta-agent tool:
recommend_model_for_task(task_pattern, intent?).
- UI: new "Benchmark" tab — leaderboard, per-task heatmap, cost-vs-quality scatter plot.
Implementation plan
- v022 migration.
- New service
stackunderflow/services/benchmark.py:
sample_sessions(conn, count, *, intent_filter, ...) -> [session_id] — stratified sample (mix of intents, models, outcomes).
score(static_findings, llm_grade) -> {static, llm, combined} — weighted average; weights configurable.
recommend(conn, task_features) -> {model, confidence, evidence_runs} — query benchmark_outcomes filtered by similar task_features.
- Background runner using Spec 25's fork machinery: for each (sample_session, candidate_model), kick a fork; on completion, compute score (uses Spec 21 + Spec 23).
- CLI + API + meta-agent + UI tab.
Tests
- Sampling stratification: synthetic store, assert sample covers all intents.
- Scoring: known static + LLM scores → assert combined.
- Recommendation: synthetic benchmark_outcomes → assert correct model picked for matching task.
- Budget cap respected: forks past the cap don't run; status='cancelled-over-budget'.
Hard parts
- Scoring rubric design. This is product judgment, not engineering. Maintainer should write the v1 rubric before any agent implements. Suggested:
- Static (40%): test_pass_delta, complexity_delta, lint_delta, type_completeness_delta
- LLM (40%): problem_solved + code_clean + edge_cases (Spec 23)
- Cost (20%): inverse-normalized cost vs the cheapest candidate
- Cost. Replaying 10 sessions × 4 models × ~50 turns each = ~$5-50 of cloud-LLM cost. Default to local-Ollama-only; require explicit
--allow-cloud for cloud forks.
- Sampling bias. If the sample is skewed (all-Opus sessions), the result is biased. Stratify carefully.
- "Replay quality is not real quality." A model "winning" a replay doesn't mean it would have won the original session — it had less context, different tools, different state. Document this caveat loudly.
Out of scope
- Cross-user benchmark aggregation (defer to Spec 28 multi-device + an opt-in "share" surface).
- Real-time benchmark on every session (this is sample-based offline).
- Closed-source-model parity (we replay against whatever the user has API access to).
Dependencies
- Blocked by: Spec 21 (static analysis), Spec 23 (LLM grader), Spec 25 (fork mode).
- This is the project's flagship feature. Maintainer should write the rubric v1 before dispatch.
Estimated effort
Size XL — single agent, ~4-6 hr. Combined: scoring, sampling, recommender, UI tab. Could be split into 3 issues (sampler+runner / scoring+recommender / UI).
Hard rules
- DO NOT touch versions / CHANGELOG headings.
- Pre-assigned schema slot: v022.
- Branch:
feat/comparative-benchmark off main.
- Default to LOCAL Ollama only; cloud forks require explicit opt-in + budget cap.
- Maintainer writes the scoring rubric BEFORE this dispatches. Issue stays in
needs-design label until rubric is in docs/specs/benchmark-rubric-v1.md.
Goal
The killer feature. Pick a representative subset of the user's past sessions, replay each one against N candidate models, score each replay deterministically, surface "for your kind of work, model X wins at $Y per outcome." Move the answer from vendor folklore to empirical fact.
Why now
Every team is overpaying for the wrong model on the wrong task. Nobody can prove it. Owning the empirical answer is the most defensible product position the project can take. Mode recommender (Spec 18) is the heuristic v1 of this; this is the v2.
Schema
v022 —
benchmark_runs+benchmark_outcomes:User-visible surface
stackunderflow benchmark run --name "may-2026" --models claude-opus-4-7,claude-sonnet-4-5,gemini-3-pro --sample-size 10 [--budget-cap-usd 5].stackunderflow benchmark show <name>— leaderboard table.stackunderflow benchmark recommend [--task-pattern "fix-bug" | "refactor" | "build-new"]— uses past benchmark data to recommend a model for a new prompt.POST /api/benchmark/runs,GET /api/benchmark/runs/{id},GET /api/benchmark/recommend.recommend_model_for_task(task_pattern, intent?).Implementation plan
stackunderflow/services/benchmark.py:sample_sessions(conn, count, *, intent_filter, ...) -> [session_id]— stratified sample (mix of intents, models, outcomes).score(static_findings, llm_grade) -> {static, llm, combined}— weighted average; weights configurable.recommend(conn, task_features) -> {model, confidence, evidence_runs}— query benchmark_outcomes filtered by similar task_features.Tests
Hard parts
--allow-cloudfor cloud forks.Out of scope
Dependencies
Estimated effort
Size XL — single agent, ~4-6 hr. Combined: scoring, sampling, recommender, UI tab. Could be split into 3 issues (sampler+runner / scoring+recommender / UI).
Hard rules
feat/comparative-benchmarkoff main.needs-designlabel until rubric is indocs/specs/benchmark-rubric-v1.md.