feat(eval-campaign): trust-gate primitive — IRR floor + per-item rater spread + named reasons#17
Merged
Merged
Conversation
…r spread + named reasons
Add `trustVerdicts`, the code "Enforced by" for the measurement-validation
skill's after-gate ("is this result allowed to be believed"). Sibling to
`aggregateJudgeVerdicts`, one level up: that reducer folds ONE item's raters to
a composite; this audits the raters ACROSS items and reports whether the
composites are believable before a lift is reported.
Three fail-loud checks, each named in `trustReasons` with its number and value:
(1) corpus inter-rater reliability ≥ irrFloor (default 0.2), leaning on the
substrate's `interRaterReliability`,
(2) per-item rater spread ≤ spreadCeiling (default 0.5),
(3) surviving raters per item ≥ minSurvivors (default 3).
Per-item spread is within-item by construction (max − min across THAT item's
surviving raters, max over its dimensions), never pooled across items or sides —
pooling reads a genuine quality gap between items as "raters split" and trips
the gate exactly when the finding is largest. Empty corpus throws; a failed
judge is dropped, never folded as a zero.
Exported from `@tangle-network/agent-app/eval-campaign`; additive, non-breaking.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
trustVerdictsto@tangle-network/agent-app/eval-campaign— the code Enforced by for themeasurement-validationskill's after-gate ("is this result allowed to be believed"). Additive, non-breaking.Sibling to
aggregateJudgeVerdicts, one level up. That reducer folds one item's raters into a composite;trustVerdictsaudits the raters across the corpus and decides whether the composites are believable before a lift is reported over them. Pure: no LLM, no I/O, no clock, no random.The three checks
Each failed check is named in
trustReasonswith its number and the offending value;trustReasonsis empty ifftrustworthy.irrFloor(leans on the substrate'sinterRaterReliability)spreadCeilingminSurvivorsReturns
{ trustworthy, trustReasons, interRaterReliability, perItemSpread }. All thresholds overridable.Metric semantics (the load-bearing part)
Per-item spread is within-item:
max(score) − min(score)across THAT item's surviving raters (max over its dimensions), never pooled across different items or the baseline/candidate sides. Pooling reads a genuine quality gap between items as "the raters split" and so trips the gate exactly when the finding is largest — the failure mode the after-gate exists to prevent. The corpus IRR (check 1) uses the substrate's Krippendorff-style α, whose expected-disagreement denominator already pools across items, so genuine item-to-item variation raises reliability rather than lowering it.Fail-loud: an empty corpus throws (no silent trust over zero evidence); a failed judge (
perDimension: null) is dropped, never folded as a zero.Skill wiring
One
Enforced byline under invariant #2 of.claude/skills/measurement-validation/SKILL.md, pointing at the new export. No duplication of the skill's content.Verification
pnpm typecheck— cleanpnpm build— clean (export present indist/eval-campaign/index.d.ts)pnpm test— 188 passed / 0 failed (23 suites), 7 new deterministic trust-gate tests covering: agreeing raters over wide item-to-item differences → trustworthy with zero spread flags; raters split on one item → (2) names it; low IRR → (1) with values; few survivors → (3); failed-judge drop; all thresholds overridable; empty-corpus throw.