feat(bench): EOPS analyst-steered depth (S1) — de-confound steer quality#201
Merged
Conversation
The n=24 EOPS gate showed GENERIC-steered depth losing to breadth (-9.9pp). But the prior depth-WIN (+13.4pp) used a trace-analyst on a different harness — so steer type, harness, and n were all confounded. This isolates the steerer: same eops-gate harness, same domain, same n, same breadth control — only STEER varies. - routerToolLoop now returns toolTrace (each tool call + result) — what a trace analyst reads (behavior, never the verdict). - STEER=analyst wires an inline agent-eval-style trace-analyst as the depth steer: it reads the agent's tool-call trace, diagnoses the remaining gap, and issues one concrete corrective instruction. FIREWALLED — never sees the verifiers/expected values. STEER=generic keeps the fixed nudge (the -9.9pp control). The decisive comparison: depth@analyst vs breadth, vs depth@generic vs breadth. If the analyst flips depth from significantly-losing to winning where the generic steer lost — same harness/n/domain — steer QUALITY is the operative variable, not depth. Maps to the repo's "analyst = 3 runtimes" F3 (inline / Halo-cli / sandboxed-fanout).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Isolates steer quality as the variable in the EOPS depth-vs-breadth gate. The prior depth-win (+13.4pp) used a trace-analyst on a different harness, so steer type, harness, and n were all confounded vs the n=24 generic-steer result (−9.9pp). This varies only the steerer — same
eops-gateharness, same domain, same n=24, same breadth control.routerToolLoopnow returnstoolTrace(each tool call + result) — what a trace-analyst reads (behavior, never the verdict).STEER=analystwires an inline agent-eval-style trace-analyst as the depth steer: reads the agent's tool-call trace, diagnoses the remaining gap, issues one concrete corrective instruction. Firewalled — never sees the verifiers/expected values.STEER=generickeeps the fixed nudge.Result (n=24, gpt-4.1, paired bootstrap) — the clean de-confound
A real inline analyst moves depth in the right direction (−9.9 → −7.2, out of significance) but still does not beat breadth. So a good steerer helps directionally, but is not enough to flip depth>breadth on EOPS in this off-box harness — the prior +13.4pp does not replicate under clean paired-bootstrap, even with an analyst.
What's next (the matrix, mapped to the repo's "analyst = 3 runtimes" F3)
The richest remaining steerer is HALO (the recursive/parallel fanout trace-analysis engine, on PATH) — the last shot at flipping it. It connects via the router; feeding it needs OpenInference/OTLP-shaped span traces (build pending). S3 = HALO + agent-eval combined.
Test
typecheck clean; STEER=analyst ran n=24 0-excluded against the live gym; analyst steer confirmed firing.