Skip to content

feat(bench): HumanEval steering adapter + off-box runExperiment (gate: steering null)#197

Merged
drewstone merged 1 commit into
mainfrom
feat/humaneval-steer-gate
Jun 8, 2026
Merged

feat(bench): HumanEval steering adapter + off-box runExperiment (gate: steering null)#197
drewstone merged 1 commit into
mainfrom
feat/humaneval-steer-gate

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Runs the observe→steer efficacy gate — the experiment humaneval-gate.mts itself names as "the next one": a real runLoop rollout that self-corrects across rounds, vs the gate's stateless selection. Ships the harness to run it on any adapter off-box.

  • bench/src/benchmarks/humaneval.ts — HumanEval BenchmarkAdapter. The loader + Docker deployable verifier are moved here from humaneval-gate.mts (one home, no duplication — the gate now imports them). Judge runs the candidate against the task's own tests in an isolated --network=none container.
  • bench/src/rsi.tsBACKEND=router runs the worker off-box (router chat-completion as the leaf executor via inlineSandboxClient(createExecutor(...))): the real kernel + analyst steering, no sandbox dependency. Routes around the sandbox→router egress block (#984) and generalizes the canonical caller for every adapter. Arm labels aligned to corpus-report's random*/refine* contract so the paired-bootstrap + BH report auto-detects the contrasts.
  • adapters.ts — registers humaneval.

Gate result

gpt-3.5-turbo, HumanEval hard half (tasks 82–163), n=82, equal k=2, paired bootstrap B=10000:

contrast lift 95% CI verdict
more-compute (random@2 − blind@1) +12.2pp [+6.1, +19.5] SIGNIFICANT
observe→steer (refineAudit@2 − random@2) −1.2pp [−8.5, +6.1] n.s.
generic steer (refinePush@2 − random@2) +1.2pp [−4.9, +7.3] n.s.

At equal compute, observe→steer does NOT beat blind resampling on this deployable-checker domain; compute itself significantly does. A clean within-run-steering null — consistent with the FinSearchComp null and the selection≠RSI audit. The lone positive within-run signal remains depth>breadth on EOPS (agentic, multi-turn).

Notes

  • Band-opener constraint: kimi-k2 / glm-4.5-air are unserved off-box (router 404 / Cloudflare-block) and deepseek-chat saturates HumanEval (100% easy, 90.2% hard half) → ~0 correctable band. gpt-3.5-turbo on the hard half (54.9% pass@1) is the band. K=2 because at ~55% pass@1, random@4≈96% would erase the contrast.
  • Reproduce: BENCH=humaneval BACKEND=router WORKER_MODEL=gpt-3.5-turbo N=82 ROUNDS=2 OFFSET=82 tsx src/rsi.tstsx src/corpus-report.mts <corpus>.

Test

pnpm typecheck clean (bench tsconfig); the gate (refactored to import the moved primitives) ran live and produced the 54.9% band probe; the full A/B ran n=82 with 0 infra errors.

… result

Wires the observe→steer EFFICACY A/B (the experiment humaneval-gate.mts named
as "the next one": a real runLoop rollout that self-corrects across rounds, vs
the gate's stateless selection):

- benchmarks/humaneval.ts: HumanEval BenchmarkAdapter — loader + Docker deployable
  verifier MOVED here from humaneval-gate.mts (one home, no duplication; the gate
  now imports them). judge = run candidate against the task's own tests in an
  isolated --network=none container.
- rsi.ts: BACKEND=router runs the worker OFF-BOX (router chat-completion as the
  leaf executor via inlineSandboxClient(createExecutor)) — the real kernel +
  analyst steering with no sandbox dependency (sandbox→router egress is blocked,
  #984). Generalizes the canonical caller for every adapter, not just this run.
  Arm labels aligned to corpus-report's random*/refine* contract.

Gate result (gpt-3.5-turbo, HumanEval hard half tasks 82-163, n=82, equal k=2,
paired bootstrap B=10000):
  more-compute   random@2 − blind@1       +12.2pp  CI [+6.1, +19.5]  SIGNIFICANT
  observe→steer  refineAudit@2 − random@2  -1.2pp  CI [-8.5, +6.1]   n.s.
  generic-steer  refinePush@2 − random@2   +1.2pp  CI [-4.9, +7.3]   n.s.

At equal compute, observe→steer does NOT beat blind resampling on this
deployable-checker domain; compute itself significantly does. A clean null for
within-run steering here — consistent with the FinSearchComp null and the
selection≠RSI audit. Drew's cheap models can't open the band off-box (kimi/glm
unserved via router; deepseek-chat saturates at ~90%), hence gpt-3.5-turbo.
@drewstone drewstone merged commit e165640 into main Jun 8, 2026
1 check passed
@drewstone drewstone deleted the feat/humaneval-steer-gate branch June 8, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant