fix(bench): EOPS depth scored on best checkpoint, not final state (autopsy reverses the result)#202
Merged
Merged
Conversation
…topsy) Autopsy of the "depth loses to breadth" result (.evolve/autopsies/2026-06-08): the comparison was asymmetric. breadth = best-of-K (max over K independent shots, verifier-selected); depth = FINAL state after K shots. So depth alone paid for late-shot degradation — a steer that makes the model re-touch the DB and undo correct work. Artifacts showed the signature: depth ending 0/N on tasks breadth solved (2/2->0, 5/7->0). Fix: score the DB state after EVERY depth shot; report depth-BEST (max checkpoint, symmetric with breadth's best-of-K) alongside depth-FINAL. Checkpointing is deployable (snapshot the artifact, keep the best-verifying state). Re-run (S0 generic, n=24, gpt-4.1): the -9.9pp REVERSES. depth-FINAL - breadth -0.1pp n.s. (the -9.9pp was noise + degradation) depth-BEST - breadth +6.0pp CI [-0.4,+13.1] score depth-BEST - breadth +12.5pp CI [ 0.0,+25.0] resolved degradation = best - final = +6.2pp (steering reached better states, then undid them) So within-run steering does NOT lose on EOPS — depth beats breadth even with a GENERIC steer, once scored fairly. The HumanEval gates used best-of-K on BOTH arms (no asymmetry) so those results are unaffected. depth-best is now the headline metric.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Autopsy finding
/autopsyof the "depth loses to breadth on EOPS" result (.evolve/autopsies/2026-06-08-eops-depth-breadth.md). Root cause: design-flaw in my own harness. The comparison was asymmetric:breadth= best-of-K (max over K independent shots, verifier-selected)depth= final state after K sequential shotsSo depth alone paid for late-shot degradation — a steer that makes the model re-touch the DB and undo correct work. Ground-truth artifacts showed the signature: depth ending
0/Non tasks breadth solved (2/2→0,5/7→0).Fix
Score the DB state after every depth shot; report depth-BEST (max checkpoint, symmetric with breadth's best-of-K) alongside depth-FINAL. Checkpointing is deployable — snapshot the artifact, keep the best-verifying state (exactly what breadth does across its K).
The result reverses (S0 generic, n=24, gpt-4.1, paired bootstrap)
Within-run steering does NOT lose on EOPS. Depth beats breadth even with a generic steer, once scored fairly. The earlier "steering loses everywhere" conclusion was a scoring artifact and should not drive strategy.
The HumanEval gates used best-of-K on both arms (no asymmetry), so those results stand — the boundary is intact: breadth wins on stateless codegen, depth(+checkpoint) wins on stateful agentic ops. Two independent engineering takeaways: (1) always checkpoint the artifact and keep the best-verifying state — final-state scoring undersells steering; (2) steering can actively degrade, so the keep-best policy is load-bearing, not cosmetic.
Test
typecheck clean; instrumented n=24 ran 0-excluded against the live gym; depth-best/final/trajectory all reported.