feat(bench): HumanEval steering adapter + off-box runExperiment (gate: steering null) by drewstone · Pull Request #197 · tangle-network/agent-runtime

drewstone · 2026-06-08T16:08:18Z

What

Runs the observe→steer efficacy gate — the experiment humaneval-gate.mts itself names as "the next one": a real runLoop rollout that self-corrects across rounds, vs the gate's stateless selection. Ships the harness to run it on any adapter off-box.

bench/src/benchmarks/humaneval.ts — HumanEval BenchmarkAdapter. The loader + Docker deployable verifier are moved here from humaneval-gate.mts (one home, no duplication — the gate now imports them). Judge runs the candidate against the task's own tests in an isolated --network=none container.
bench/src/rsi.ts — BACKEND=router runs the worker off-box (router chat-completion as the leaf executor via inlineSandboxClient(createExecutor(...))): the real kernel + analyst steering, no sandbox dependency. Routes around the sandbox→router egress block (#984) and generalizes the canonical caller for every adapter. Arm labels aligned to corpus-report's random*/refine* contract so the paired-bootstrap + BH report auto-detects the contrasts.
adapters.ts — registers humaneval.

Gate result

gpt-3.5-turbo, HumanEval hard half (tasks 82–163), n=82, equal k=2, paired bootstrap B=10000:

contrast	lift	95% CI	verdict
more-compute (`random@2 − blind@1`)	+12.2pp	[+6.1, +19.5]	SIGNIFICANT
observe→steer (`refineAudit@2 − random@2`)	−1.2pp	[−8.5, +6.1]	n.s.
generic steer (`refinePush@2 − random@2`)	+1.2pp	[−4.9, +7.3]	n.s.

At equal compute, observe→steer does NOT beat blind resampling on this deployable-checker domain; compute itself significantly does. A clean within-run-steering null — consistent with the FinSearchComp null and the selection≠RSI audit. The lone positive within-run signal remains depth>breadth on EOPS (agentic, multi-turn).

Notes

Band-opener constraint: kimi-k2 / glm-4.5-air are unserved off-box (router 404 / Cloudflare-block) and deepseek-chat saturates HumanEval (100% easy, 90.2% hard half) → ~0 correctable band. gpt-3.5-turbo on the hard half (54.9% pass@1) is the band. K=2 because at ~55% pass@1, random@4≈96% would erase the contrast.
Reproduce: BENCH=humaneval BACKEND=router WORKER_MODEL=gpt-3.5-turbo N=82 ROUNDS=2 OFFSET=82 tsx src/rsi.ts → tsx src/corpus-report.mts <corpus>.

Test

pnpm typecheck clean (bench tsconfig); the gate (refactored to import the moved primitives) ran live and produced the 54.9% band probe; the full A/B ran n=82 with 0 infra errors.

… result Wires the observe→steer EFFICACY A/B (the experiment humaneval-gate.mts named as "the next one": a real runLoop rollout that self-corrects across rounds, vs the gate's stateless selection): - benchmarks/humaneval.ts: HumanEval BenchmarkAdapter — loader + Docker deployable verifier MOVED here from humaneval-gate.mts (one home, no duplication; the gate now imports them). judge = run candidate against the task's own tests in an isolated --network=none container. - rsi.ts: BACKEND=router runs the worker OFF-BOX (router chat-completion as the leaf executor via inlineSandboxClient(createExecutor)) — the real kernel + analyst steering with no sandbox dependency (sandbox→router egress is blocked, #984). Generalizes the canonical caller for every adapter, not just this run. Arm labels aligned to corpus-report's random*/refine* contract. Gate result (gpt-3.5-turbo, HumanEval hard half tasks 82-163, n=82, equal k=2, paired bootstrap B=10000): more-compute random@2 − blind@1 +12.2pp CI [+6.1, +19.5] SIGNIFICANT observe→steer refineAudit@2 − random@2 -1.2pp CI [-8.5, +6.1] n.s. generic-steer refinePush@2 − random@2 +1.2pp CI [-4.9, +7.3] n.s. At equal compute, observe→steer does NOT beat blind resampling on this deployable-checker domain; compute itself significantly does. A clean null for within-run steering here — consistent with the FinSearchComp null and the selection≠RSI audit. Drew's cheap models can't open the band off-box (kimi/glm unserved via router; deepseek-chat saturates at ~90%), hence gpt-3.5-turbo.

drewstone merged commit e165640 into main Jun 8, 2026
1 check passed

drewstone deleted the feat/humaneval-steer-gate branch June 8, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): HumanEval steering adapter + off-box runExperiment (gate: steering null)#197

feat(bench): HumanEval steering adapter + off-box runExperiment (gate: steering null)#197
drewstone merged 1 commit into
mainfrom
feat/humaneval-steer-gate

drewstone commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 8, 2026

What

Gate result

Notes

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant