feat(bench): EOPS analyst-steered depth (S1) — de-confound steer quality by drewstone · Pull Request #201 · tangle-network/agent-runtime

drewstone · 2026-06-08T23:20:08Z

What

Isolates steer quality as the variable in the EOPS depth-vs-breadth gate. The prior depth-win (+13.4pp) used a trace-analyst on a different harness, so steer type, harness, and n were all confounded vs the n=24 generic-steer result (−9.9pp). This varies only the steerer — same eops-gate harness, same domain, same n=24, same breadth control.

routerToolLoop now returns toolTrace (each tool call + result) — what a trace-analyst reads (behavior, never the verdict).
STEER=analyst wires an inline agent-eval-style trace-analyst as the depth steer: reads the agent's tool-call trace, diagnoses the remaining gap, issues one concrete corrective instruction. Firewalled — never sees the verifiers/expected values. STEER=generic keeps the fixed nudge.

Result (n=24, gpt-4.1, paired bootstrap) — the clean de-confound

steerer	depth − breadth (score)	verdict
generic (S0)	−9.9pp	SIGNIF negative
agent-eval analyst (S1)	−7.2pp	n.s. (CI [−19.7, +4.6], disc 9)

A real inline analyst moves depth in the right direction (−9.9 → −7.2, out of significance) but still does not beat breadth. So a good steerer helps directionally, but is not enough to flip depth>breadth on EOPS in this off-box harness — the prior +13.4pp does not replicate under clean paired-bootstrap, even with an analyst.

What's next (the matrix, mapped to the repo's "analyst = 3 runtimes" F3)

The richest remaining steerer is HALO (the recursive/parallel fanout trace-analysis engine, on PATH) — the last shot at flipping it. It connects via the router; feeding it needs OpenInference/OTLP-shaped span traces (build pending). S3 = HALO + agent-eval combined.

Test

typecheck clean; STEER=analyst ran n=24 0-excluded against the live gym; analyst steer confirmed firing.

The n=24 EOPS gate showed GENERIC-steered depth losing to breadth (-9.9pp). But the prior depth-WIN (+13.4pp) used a trace-analyst on a different harness — so steer type, harness, and n were all confounded. This isolates the steerer: same eops-gate harness, same domain, same n, same breadth control — only STEER varies. - routerToolLoop now returns toolTrace (each tool call + result) — what a trace analyst reads (behavior, never the verdict). - STEER=analyst wires an inline agent-eval-style trace-analyst as the depth steer: it reads the agent's tool-call trace, diagnoses the remaining gap, and issues one concrete corrective instruction. FIREWALLED — never sees the verifiers/expected values. STEER=generic keeps the fixed nudge (the -9.9pp control). The decisive comparison: depth@analyst vs breadth, vs depth@generic vs breadth. If the analyst flips depth from significantly-losing to winning where the generic steer lost — same harness/n/domain — steer QUALITY is the operative variable, not depth. Maps to the repo's "analyst = 3 runtimes" F3 (inline / Halo-cli / sandboxed-fanout).

drewstone merged commit 6d90502 into main Jun 8, 2026
1 check passed

drewstone deleted the feat/eops-analyst-steer branch June 8, 2026 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): EOPS analyst-steered depth (S1) — de-confound steer quality#201

feat(bench): EOPS analyst-steered depth (S1) — de-confound steer quality#201
drewstone merged 1 commit into
mainfrom
feat/eops-analyst-steer

drewstone commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 8, 2026

What

Result (n=24, gpt-4.1, paired bootstrap) — the clean de-confound

What's next (the matrix, mapped to the repo's "analyst = 3 runtimes" F3)

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant