Spec 23: LLM-graded session quality — offline rubric scoring of every session

## Goal
Run an offline LLM (local Ollama) over each session's transcript + tool calls and emit a quality grade against a multi-dimensional rubric: problem-solved, tests-added, code-clean, edge-cases-handled. Subjective per call, consistent in aggregate. Used for ranking, not absolute scores.

## Why now
Static analysis (Spec 21) gives objective deltas; outcome attribution (Spec 22) gives "did the code ship". Neither answers "was the agent's reasoning sound?" or "did the agent skip an edge case?". An LLM can grade that — coarsely, but consistently.

## Schema
**v020** — `session_quality_grades`:

```sql
CREATE TABLE session_quality_grades (
  id INTEGER PRIMARY KEY,
  session_id TEXT NOT NULL,
  grader_model TEXT NOT NULL,     -- which model graded (e.g. 'qwen2.5-coder:7b')
  rubric_version INTEGER NOT NULL,-- bump when the rubric changes
  ts TEXT NOT NULL,
  problem_solved REAL,            -- [0, 1] or NULL if "can't tell"
  tests_added REAL,
  code_clean REAL,
  edge_cases REAL,
  overall REAL,                   -- weighted average; weights in details
  reasoning TEXT,                 -- ~200-char summary
  details_json TEXT,              -- per-rubric notes
  UNIQUE (session_id, grader_model, rubric_version)
);
CREATE INDEX idx_sqg_session ON session_quality_grades(session_id);
CREATE INDEX idx_sqg_overall ON session_quality_grades(overall DESC);
```

## User-visible surface
- **CLI**: `stackunderflow grade session <id> [--model qwen2.5-coder:7b]` runs the grader on one session.
- **CLI**: `stackunderflow grade backfill [--since 30d] [--limit N] [--model M]` batches over recent ungraded sessions.
- **API**: `GET /api/grades/session/{id}` and `GET /api/grades/recent?min_overall=0.7`.
- **Meta-agent tool**: `get_session_grade(session_id)`.
- **UI**: a "Quality" sortable column on Sessions tab + a per-session breakdown panel.

## Implementation plan
1. v020 migration.
2. New service `stackunderflow/services/quality_grader.py`:
   - `RUBRIC_V1` — the prompt that asks the grader to score each dimension.
   - `_serialize_session_for_grading(session_id, max_tokens=8000)` — produce a compact transcript: user prompts, assistant summaries, key tool calls, final outcome. Stays under the model's context budget.
   - `grade(conn, session_id, grader_model)` — call Ollama, parse JSON-shaped response, persist.
   - `backfill(conn, since, limit, model)` — batch with concurrency cap (default 2).
3. Reuse the meta-agent's Ollama proxy / NDJSON streaming where appropriate.
4. CLI + API + meta-agent wiring.
5. Frontend column.

## Tests
- Mock Ollama with a deterministic response → assert parse + persist.
- Malformed grader response (non-JSON) → graceful failure (don't persist; log; continue).
- Idempotency: re-grading with the same `(session, model, rubric_version)` upserts.
- Rubric-version bump invalidates old grades on read (return latest).

## Hard parts
- The grader output must be machine-parseable. Use structured JSON output (Ollama supports `format: "json"`). Have a strict schema; reject malformed responses.
- Context-window management: a 6000-message session can't fit. Truncate intelligently: keep first 5 + last 5 user prompts, sample the middle, include key signals (errors, reverts, "thanks").
- The grader is a model — it has its own biases. Document this explicitly. Recommend running the grader with multiple models and averaging if precision matters.
- Cost: grading is ~$0.001-0.01 per session at local Ollama (free) or ~$0.05-0.20 via cloud. Local-only by default.

## Out of scope
- Cloud-API grading by default (local Ollama only; cloud would be a future opt-in).
- Real-time grading (this is offline backfill).
- Cross-session "agent-skill ranking" (Spec 26's job).

## Dependencies
- Spec 21 (static analysis) is helpful context but not blocking.
- Consumed by Spec 26 (comparative benchmark) — combines static + LLM grades into the scoring rubric.

## Estimated effort
**Size M** — single agent, ~1-1.5 hr.

## Hard rules
- DO NOT touch versions / CHANGELOG headings.
- Pre-assigned schema slot: **v020**.
- Branch: `feat/llm-graded-quality` off main.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec 23: LLM-graded session quality — offline rubric scoring of every session #95

Goal

Why now

Schema

User-visible surface

Implementation plan

Tests

Hard parts

Out of scope

Dependencies

Estimated effort

Hard rules

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Spec 23: LLM-graded session quality — offline rubric scoring of every session #95

Description

Goal

Why now

Schema

User-visible surface

Implementation plan

Tests

Hard parts

Out of scope

Dependencies

Estimated effort

Hard rules

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions