fix(bench): EOPS depth scored on best checkpoint, not final state (autopsy reverses the result) by drewstone · Pull Request #202 · tangle-network/agent-runtime

drewstone · 2026-06-08T23:32:39Z

Autopsy finding

/autopsy of the "depth loses to breadth on EOPS" result (.evolve/autopsies/2026-06-08-eops-depth-breadth.md). Root cause: design-flaw in my own harness. The comparison was asymmetric:

breadth = best-of-K (max over K independent shots, verifier-selected)
depth = final state after K sequential shots

So depth alone paid for late-shot degradation — a steer that makes the model re-touch the DB and undo correct work. Ground-truth artifacts showed the signature: depth ending 0/N on tasks breadth solved (2/2→0, 5/7→0).

Fix

Score the DB state after every depth shot; report depth-BEST (max checkpoint, symmetric with breadth's best-of-K) alongside depth-FINAL. Checkpointing is deployable — snapshot the artifact, keep the best-verifying state (exactly what breadth does across its K).

The result reverses (S0 generic, n=24, gpt-4.1, paired bootstrap)

metric	depth − breadth
depth-FINAL (the old, biased number)	−0.1pp n.s. (the prior −9.9pp was noise + degradation)
depth-BEST − breadth, score	+6.0pp CI [−0.4, +13.1]
depth-BEST − breadth, resolved	+12.5pp CI [0.0, +25.0]
degradation = best − final	+6.2pp (steering reached better states, then undid them)

Within-run steering does NOT lose on EOPS. Depth beats breadth even with a generic steer, once scored fairly. The earlier "steering loses everywhere" conclusion was a scoring artifact and should not drive strategy.

The HumanEval gates used best-of-K on both arms (no asymmetry), so those results stand — the boundary is intact: breadth wins on stateless codegen, depth(+checkpoint) wins on stateful agentic ops. Two independent engineering takeaways: (1) always checkpoint the artifact and keep the best-verifying state — final-state scoring undersells steering; (2) steering can actively degrade, so the keep-best policy is load-bearing, not cosmetic.

Test

typecheck clean; instrumented n=24 ran 0-excluded against the live gym; depth-best/final/trajectory all reported.

…topsy) Autopsy of the "depth loses to breadth" result (.evolve/autopsies/2026-06-08): the comparison was asymmetric. breadth = best-of-K (max over K independent shots, verifier-selected); depth = FINAL state after K shots. So depth alone paid for late-shot degradation — a steer that makes the model re-touch the DB and undo correct work. Artifacts showed the signature: depth ending 0/N on tasks breadth solved (2/2->0, 5/7->0). Fix: score the DB state after EVERY depth shot; report depth-BEST (max checkpoint, symmetric with breadth's best-of-K) alongside depth-FINAL. Checkpointing is deployable (snapshot the artifact, keep the best-verifying state). Re-run (S0 generic, n=24, gpt-4.1): the -9.9pp REVERSES. depth-FINAL - breadth -0.1pp n.s. (the -9.9pp was noise + degradation) depth-BEST - breadth +6.0pp CI [-0.4,+13.1] score depth-BEST - breadth +12.5pp CI [ 0.0,+25.0] resolved degradation = best - final = +6.2pp (steering reached better states, then undid them) So within-run steering does NOT lose on EOPS — depth beats breadth even with a GENERIC steer, once scored fairly. The HumanEval gates used best-of-K on BOTH arms (no asymmetry) so those results are unaffected. depth-best is now the headline metric.

drewstone merged commit b81d6ef into main Jun 8, 2026
1 check passed

drewstone deleted the fix/eops-depth-best-scoring branch June 8, 2026 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): EOPS depth scored on best checkpoint, not final state (autopsy reverses the result)#202

fix(bench): EOPS depth scored on best checkpoint, not final state (autopsy reverses the result)#202
drewstone merged 1 commit into
mainfrom
fix/eops-depth-best-scoring

drewstone commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 8, 2026

Autopsy finding

Fix

The result reverses (S0 generic, n=24, gpt-4.1, paired bootstrap)

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant