Goal
Run an offline LLM (local Ollama) over each session's transcript + tool calls and emit a quality grade against a multi-dimensional rubric: problem-solved, tests-added, code-clean, edge-cases-handled. Subjective per call, consistent in aggregate. Used for ranking, not absolute scores.
Why now
Static analysis (Spec 21) gives objective deltas; outcome attribution (Spec 22) gives "did the code ship". Neither answers "was the agent's reasoning sound?" or "did the agent skip an edge case?". An LLM can grade that — coarsely, but consistently.
Schema
v020 — session_quality_grades:
CREATE TABLE session_quality_grades (
id INTEGER PRIMARY KEY,
session_id TEXT NOT NULL,
grader_model TEXT NOT NULL, -- which model graded (e.g. 'qwen2.5-coder:7b')
rubric_version INTEGER NOT NULL,-- bump when the rubric changes
ts TEXT NOT NULL,
problem_solved REAL, -- [0, 1] or NULL if "can't tell"
tests_added REAL,
code_clean REAL,
edge_cases REAL,
overall REAL, -- weighted average; weights in details
reasoning TEXT, -- ~200-char summary
details_json TEXT, -- per-rubric notes
UNIQUE (session_id, grader_model, rubric_version)
);
CREATE INDEX idx_sqg_session ON session_quality_grades(session_id);
CREATE INDEX idx_sqg_overall ON session_quality_grades(overall DESC);
User-visible surface
- CLI:
stackunderflow grade session <id> [--model qwen2.5-coder:7b] runs the grader on one session.
- CLI:
stackunderflow grade backfill [--since 30d] [--limit N] [--model M] batches over recent ungraded sessions.
- API:
GET /api/grades/session/{id} and GET /api/grades/recent?min_overall=0.7.
- Meta-agent tool:
get_session_grade(session_id).
- UI: a "Quality" sortable column on Sessions tab + a per-session breakdown panel.
Implementation plan
- v020 migration.
- New service
stackunderflow/services/quality_grader.py:
RUBRIC_V1 — the prompt that asks the grader to score each dimension.
_serialize_session_for_grading(session_id, max_tokens=8000) — produce a compact transcript: user prompts, assistant summaries, key tool calls, final outcome. Stays under the model's context budget.
grade(conn, session_id, grader_model) — call Ollama, parse JSON-shaped response, persist.
backfill(conn, since, limit, model) — batch with concurrency cap (default 2).
- Reuse the meta-agent's Ollama proxy / NDJSON streaming where appropriate.
- CLI + API + meta-agent wiring.
- Frontend column.
Tests
- Mock Ollama with a deterministic response → assert parse + persist.
- Malformed grader response (non-JSON) → graceful failure (don't persist; log; continue).
- Idempotency: re-grading with the same
(session, model, rubric_version) upserts.
- Rubric-version bump invalidates old grades on read (return latest).
Hard parts
- The grader output must be machine-parseable. Use structured JSON output (Ollama supports
format: "json"). Have a strict schema; reject malformed responses.
- Context-window management: a 6000-message session can't fit. Truncate intelligently: keep first 5 + last 5 user prompts, sample the middle, include key signals (errors, reverts, "thanks").
- The grader is a model — it has its own biases. Document this explicitly. Recommend running the grader with multiple models and averaging if precision matters.
- Cost: grading is ~$0.001-0.01 per session at local Ollama (free) or ~$0.05-0.20 via cloud. Local-only by default.
Out of scope
- Cloud-API grading by default (local Ollama only; cloud would be a future opt-in).
- Real-time grading (this is offline backfill).
- Cross-session "agent-skill ranking" (Spec 26's job).
Dependencies
- Spec 21 (static analysis) is helpful context but not blocking.
- Consumed by Spec 26 (comparative benchmark) — combines static + LLM grades into the scoring rubric.
Estimated effort
Size M — single agent, ~1-1.5 hr.
Hard rules
- DO NOT touch versions / CHANGELOG headings.
- Pre-assigned schema slot: v020.
- Branch:
feat/llm-graded-quality off main.
Goal
Run an offline LLM (local Ollama) over each session's transcript + tool calls and emit a quality grade against a multi-dimensional rubric: problem-solved, tests-added, code-clean, edge-cases-handled. Subjective per call, consistent in aggregate. Used for ranking, not absolute scores.
Why now
Static analysis (Spec 21) gives objective deltas; outcome attribution (Spec 22) gives "did the code ship". Neither answers "was the agent's reasoning sound?" or "did the agent skip an edge case?". An LLM can grade that — coarsely, but consistently.
Schema
v020 —
session_quality_grades:User-visible surface
stackunderflow grade session <id> [--model qwen2.5-coder:7b]runs the grader on one session.stackunderflow grade backfill [--since 30d] [--limit N] [--model M]batches over recent ungraded sessions.GET /api/grades/session/{id}andGET /api/grades/recent?min_overall=0.7.get_session_grade(session_id).Implementation plan
stackunderflow/services/quality_grader.py:RUBRIC_V1— the prompt that asks the grader to score each dimension._serialize_session_for_grading(session_id, max_tokens=8000)— produce a compact transcript: user prompts, assistant summaries, key tool calls, final outcome. Stays under the model's context budget.grade(conn, session_id, grader_model)— call Ollama, parse JSON-shaped response, persist.backfill(conn, since, limit, model)— batch with concurrency cap (default 2).Tests
(session, model, rubric_version)upserts.Hard parts
format: "json"). Have a strict schema; reject malformed responses.Out of scope
Dependencies
Estimated effort
Size M — single agent, ~1-1.5 hr.
Hard rules
feat/llm-graded-qualityoff main.