feat(eval-campaign): trust-gate primitive — IRR floor + per-item rater spread + named reasons by drewstone · Pull Request #17 · tangle-network/agent-app

drewstone · 2026-06-07T16:54:34Z

What

Adds trustVerdicts to @tangle-network/agent-app/eval-campaign — the code Enforced by for the measurement-validation skill's after-gate ("is this result allowed to be believed"). Additive, non-breaking.

Sibling to aggregateJudgeVerdicts, one level up. That reducer folds one item's raters into a composite; trustVerdicts audits the raters across the corpus and decides whether the composites are believable before a lift is reported over them. Pure: no LLM, no I/O, no clock, no random.

The three checks

Each failed check is named in trustReasons with its number and the offending value; trustReasons is empty iff trustworthy.

#	Check	Default
(1)	corpus inter-rater reliability ≥ `irrFloor` (leans on the substrate's `interRaterReliability`)	0.2
(2)	per-item rater spread ≤ `spreadCeiling`	0.5
(3)	surviving raters per item ≥ `minSurvivors`	3

Returns { trustworthy, trustReasons, interRaterReliability, perItemSpread }. All thresholds overridable.

Metric semantics (the load-bearing part)

Per-item spread is within-item: max(score) − min(score) across THAT item's surviving raters (max over its dimensions), never pooled across different items or the baseline/candidate sides. Pooling reads a genuine quality gap between items as "the raters split" and so trips the gate exactly when the finding is largest — the failure mode the after-gate exists to prevent. The corpus IRR (check 1) uses the substrate's Krippendorff-style α, whose expected-disagreement denominator already pools across items, so genuine item-to-item variation raises reliability rather than lowering it.

Fail-loud: an empty corpus throws (no silent trust over zero evidence); a failed judge (perDimension: null) is dropped, never folded as a zero.

Skill wiring

One Enforced by line under invariant #2 of .claude/skills/measurement-validation/SKILL.md, pointing at the new export. No duplication of the skill's content.

Verification

pnpm typecheck — clean
pnpm build — clean (export present in dist/eval-campaign/index.d.ts)
pnpm test — 188 passed / 0 failed (23 suites), 7 new deterministic trust-gate tests covering: agreeing raters over wide item-to-item differences → trustworthy with zero spread flags; raters split on one item → (2) names it; low IRR → (1) with values; few survivors → (3); failed-judge drop; all thresholds overridable; empty-corpus throw.

…r spread + named reasons Add `trustVerdicts`, the code "Enforced by" for the measurement-validation skill's after-gate ("is this result allowed to be believed"). Sibling to `aggregateJudgeVerdicts`, one level up: that reducer folds ONE item's raters to a composite; this audits the raters ACROSS items and reports whether the composites are believable before a lift is reported. Three fail-loud checks, each named in `trustReasons` with its number and value: (1) corpus inter-rater reliability ≥ irrFloor (default 0.2), leaning on the substrate's `interRaterReliability`, (2) per-item rater spread ≤ spreadCeiling (default 0.5), (3) surviving raters per item ≥ minSurvivors (default 3). Per-item spread is within-item by construction (max − min across THAT item's surviving raters, max over its dimensions), never pooled across items or sides — pooling reads a genuine quality gap between items as "raters split" and trips the gate exactly when the finding is largest. Empty corpus throws; a failed judge is dropped, never folded as a zero. Exported from `@tangle-network/agent-app/eval-campaign`; additive, non-breaking.

drewstone merged commit c233a12 into main Jun 7, 2026
1 check passed

drewstone mentioned this pull request Jun 7, 2026

chore(release): agent-app 0.3.0 — trust gate + skills shipped in package #18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval-campaign): trust-gate primitive — IRR floor + per-item rater spread + named reasons#17

feat(eval-campaign): trust-gate primitive — IRR floor + per-item rater spread + named reasons#17
drewstone merged 1 commit into
mainfrom
feat/eval-campaign-trust-gate

drewstone commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 7, 2026

What

The three checks

Metric semantics (the load-bearing part)

Skill wiring

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant