Spec 26: comparative benchmark engine — empirical 'which model wins for your work'

## Goal
**The killer feature.** Pick a representative subset of the user's past sessions, replay each one against N candidate models, score each replay deterministically, surface "for your kind of work, model X wins at $Y per outcome." Move the answer from vendor folklore to empirical fact.

## Why now
Every team is overpaying for the wrong model on the wrong task. Nobody can prove it. Owning the empirical answer is the most defensible product position the project can take. Mode recommender (Spec 18) is the heuristic v1 of this; this is the v2.

## Schema
**v022** — `benchmark_runs` + `benchmark_outcomes`:

```sql
CREATE TABLE benchmark_runs (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL,                   -- user-chosen run label
  rubric_version INTEGER NOT NULL,
  candidate_models_json TEXT NOT NULL,  -- ['claude-opus-4-7', 'gemini-3-pro', ...]
  source_session_ids_json TEXT NOT NULL,-- the sessions sampled from history
  status TEXT NOT NULL,                 -- 'queued' | 'running' | 'complete' | 'failed'
  created_ts TEXT NOT NULL,
  completed_ts TEXT,
  total_cost_usd REAL DEFAULT 0
);

CREATE TABLE benchmark_outcomes (
  benchmark_run_id INTEGER NOT NULL REFERENCES benchmark_runs(id),
  source_session_id TEXT NOT NULL,
  candidate_model TEXT NOT NULL,
  fork_session_id TEXT,                 -- references session_forks (Spec 25)
  status TEXT NOT NULL,                 -- 'pending' | 'running' | 'complete' | 'failed'
  -- scoring (combination of Spec 21 static + Spec 23 LLM grader):
  static_score REAL,                    -- [0, 1]
  llm_score REAL,
  combined_score REAL,
  cost_usd REAL,
  duration_seconds INTEGER,
  num_turns INTEGER,
  evidence_json TEXT,
  PRIMARY KEY (benchmark_run_id, source_session_id, candidate_model)
);
```

## User-visible surface
- **CLI**: `stackunderflow benchmark run --name "may-2026" --models claude-opus-4-7,claude-sonnet-4-5,gemini-3-pro --sample-size 10 [--budget-cap-usd 5]`.
- **CLI**: `stackunderflow benchmark show <name>` — leaderboard table.
- **CLI**: `stackunderflow benchmark recommend [--task-pattern "fix-bug" | "refactor" | "build-new"]` — uses past benchmark data to recommend a model for a new prompt.
- **API**: `POST /api/benchmark/runs`, `GET /api/benchmark/runs/{id}`, `GET /api/benchmark/recommend`.
- **Meta-agent tool**: `recommend_model_for_task(task_pattern, intent?)`.
- **UI**: new "Benchmark" tab — leaderboard, per-task heatmap, cost-vs-quality scatter plot.

## Implementation plan
1. v022 migration.
2. New service `stackunderflow/services/benchmark.py`:
   - `sample_sessions(conn, count, *, intent_filter, ...) -> [session_id]` — stratified sample (mix of intents, models, outcomes).
   - `score(static_findings, llm_grade) -> {static, llm, combined}` — weighted average; weights configurable.
   - `recommend(conn, task_features) -> {model, confidence, evidence_runs}` — query benchmark_outcomes filtered by similar task_features.
3. Background runner using Spec 25's fork machinery: for each (sample_session, candidate_model), kick a fork; on completion, compute score (uses Spec 21 + Spec 23).
4. CLI + API + meta-agent + UI tab.

## Tests
- Sampling stratification: synthetic store, assert sample covers all intents.
- Scoring: known static + LLM scores → assert combined.
- Recommendation: synthetic benchmark_outcomes → assert correct model picked for matching task.
- Budget cap respected: forks past the cap don't run; status='cancelled-over-budget'.

## Hard parts
- **Scoring rubric design.** This is product judgment, not engineering. **Maintainer should write the v1 rubric before any agent implements.** Suggested:
  - Static (40%): test_pass_delta, complexity_delta, lint_delta, type_completeness_delta
  - LLM (40%): problem_solved + code_clean + edge_cases (Spec 23)
  - Cost (20%): inverse-normalized cost vs the cheapest candidate
- **Cost.** Replaying 10 sessions × 4 models × ~50 turns each = ~$5-50 of cloud-LLM cost. Default to local-Ollama-only; require explicit `--allow-cloud` for cloud forks.
- **Sampling bias.** If the sample is skewed (all-Opus sessions), the result is biased. Stratify carefully.
- **"Replay quality is not real quality."** A model "winning" a replay doesn't mean it would have won the original session — it had less context, different tools, different state. Document this caveat loudly.

## Out of scope
- Cross-user benchmark aggregation (defer to Spec 28 multi-device + an opt-in "share" surface).
- Real-time benchmark on every session (this is sample-based offline).
- Closed-source-model parity (we replay against whatever the user has API access to).

## Dependencies
- **Blocked by**: Spec 21 (static analysis), Spec 23 (LLM grader), Spec 25 (fork mode).
- This is the project's flagship feature. **Maintainer should write the rubric v1 before dispatch.**

## Estimated effort
**Size XL** — single agent, ~4-6 hr. Combined: scoring, sampling, recommender, UI tab. Could be split into 3 issues (sampler+runner / scoring+recommender / UI).

## Hard rules
- DO NOT touch versions / CHANGELOG headings.
- Pre-assigned schema slot: **v022**.
- Branch: `feat/comparative-benchmark` off main.
- Default to LOCAL Ollama only; cloud forks require explicit opt-in + budget cap.
- Maintainer writes the scoring rubric BEFORE this dispatches. Issue stays in `needs-design` label until rubric is in `docs/specs/benchmark-rubric-v1.md`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec 26: comparative benchmark engine — empirical 'which model wins for your work' #99

Goal

Why now

Schema

User-visible surface

Implementation plan

Tests

Hard parts

Out of scope

Dependencies

Estimated effort

Hard rules

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Spec 26: comparative benchmark engine — empirical 'which model wins for your work' #99

Description

Goal

Why now

Schema

User-visible surface

Implementation plan

Tests

Hard parts

Out of scope

Dependencies

Estimated effort

Hard rules

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions