Skip to content

feat(eval-campaign): trust-gate primitive — IRR floor + per-item rater spread + named reasons#17

Merged
drewstone merged 1 commit into
mainfrom
feat/eval-campaign-trust-gate
Jun 7, 2026
Merged

feat(eval-campaign): trust-gate primitive — IRR floor + per-item rater spread + named reasons#17
drewstone merged 1 commit into
mainfrom
feat/eval-campaign-trust-gate

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Adds trustVerdicts to @tangle-network/agent-app/eval-campaign — the code Enforced by for the measurement-validation skill's after-gate ("is this result allowed to be believed"). Additive, non-breaking.

Sibling to aggregateJudgeVerdicts, one level up. That reducer folds one item's raters into a composite; trustVerdicts audits the raters across the corpus and decides whether the composites are believable before a lift is reported over them. Pure: no LLM, no I/O, no clock, no random.

The three checks

Each failed check is named in trustReasons with its number and the offending value; trustReasons is empty iff trustworthy.

# Check Default
(1) corpus inter-rater reliability ≥ irrFloor (leans on the substrate's interRaterReliability) 0.2
(2) per-item rater spread ≤ spreadCeiling 0.5
(3) surviving raters per item ≥ minSurvivors 3

Returns { trustworthy, trustReasons, interRaterReliability, perItemSpread }. All thresholds overridable.

Metric semantics (the load-bearing part)

Per-item spread is within-item: max(score) − min(score) across THAT item's surviving raters (max over its dimensions), never pooled across different items or the baseline/candidate sides. Pooling reads a genuine quality gap between items as "the raters split" and so trips the gate exactly when the finding is largest — the failure mode the after-gate exists to prevent. The corpus IRR (check 1) uses the substrate's Krippendorff-style α, whose expected-disagreement denominator already pools across items, so genuine item-to-item variation raises reliability rather than lowering it.

Fail-loud: an empty corpus throws (no silent trust over zero evidence); a failed judge (perDimension: null) is dropped, never folded as a zero.

Skill wiring

One Enforced by line under invariant #2 of .claude/skills/measurement-validation/SKILL.md, pointing at the new export. No duplication of the skill's content.

Verification

  • pnpm typecheck — clean
  • pnpm build — clean (export present in dist/eval-campaign/index.d.ts)
  • pnpm test — 188 passed / 0 failed (23 suites), 7 new deterministic trust-gate tests covering: agreeing raters over wide item-to-item differences → trustworthy with zero spread flags; raters split on one item → (2) names it; low IRR → (1) with values; few survivors → (3); failed-judge drop; all thresholds overridable; empty-corpus throw.

…r spread + named reasons

Add `trustVerdicts`, the code "Enforced by" for the measurement-validation
skill's after-gate ("is this result allowed to be believed"). Sibling to
`aggregateJudgeVerdicts`, one level up: that reducer folds ONE item's raters to
a composite; this audits the raters ACROSS items and reports whether the
composites are believable before a lift is reported.

Three fail-loud checks, each named in `trustReasons` with its number and value:
  (1) corpus inter-rater reliability ≥ irrFloor (default 0.2), leaning on the
      substrate's `interRaterReliability`,
  (2) per-item rater spread ≤ spreadCeiling (default 0.5),
  (3) surviving raters per item ≥ minSurvivors (default 3).

Per-item spread is within-item by construction (max − min across THAT item's
surviving raters, max over its dimensions), never pooled across items or sides —
pooling reads a genuine quality gap between items as "raters split" and trips
the gate exactly when the finding is largest. Empty corpus throws; a failed
judge is dropped, never folded as a zero.

Exported from `@tangle-network/agent-app/eval-campaign`; additive, non-breaking.
@drewstone drewstone merged commit c233a12 into main Jun 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant