feat(bench): EOPS depth-vs-breadth gate (tool-using router worker, live gym) by drewstone · Pull Request #200 · tangle-network/agent-runtime

drewstone · 2026-06-08T22:46:59Z

What

The agentic, stateful counterpart to the HumanEval gates (where breadth/resampling won). The worker is the tool-using router backend (routerToolLoop, #199-class): it calls the EnterpriseOps-Gym live MCP tools, sees results, and acts — off-box (router inference + host→gym HTTP, no sandbox). Both arms judged by each task's own SQL verifiers (deployable check).

breadth@K — K INDEPENDENT shots on fresh seeded DBs; keep the best (resample)
depth@K   — K SEQUENTIAL steered shots over ONE persistent DB; the artifact
            accumulates, each re-engagement finishes what's left (steering)

Equal compute (K×M turns). Partial-credit verifier score is the signal at this difficulty (tasks have 8–18 verifiers; full resolution is sparse).

Gym standup (was the #985 blocker — now resolved)

gym_dbs.zip is at the root of github.com/ServiceNow/EnterpriseOps-Gym (14.5MB, not LFS — the ServiceNow-AI HF org 404 was the red herring). Unzip → EOPS_GYM_DBS_DIR; itsm container shivakrishnareddyma225/enterpriseops-gym-mcp-itsm on :8005. Substrate verified: seed→200, 93 MCP tools, tool calls + SQL verifiers all green.

Result (directional)

n=3, gpt-4.1: depth 44.4% vs breadth 27.8% verifier score → +16.7pp (CI spans 0 at n=3). Consistent with the prior depth>breadth +13.4pp, and opposite the HumanEval breadth-wins result. The full n=24 run is in flight; I'll post the paired-bootstrap verdict here.

The boundary this draws

Breadth/resampling wins where samples are independent (stateless codegen — HumanEval ×2). Depth/steering wins where the artifact is stateful and accumulates (agentic ops — EOPS). That's the product thesis, now tested from both sides on deployable checkers.

Test

typecheck clean; smoke + n=3 ran end-to-end against the live gym (seed/tools/act/verify), worker confirmed acting (6–15 tool calls/task).

…ve gym) The agentic, stateful counterpart to the HumanEval gates (where breadth won). The worker is the tool-using router backend (routerToolLoop): it calls the EOPS gym's live MCP tools, sees results, and acts — off-box (router inference + host→gym HTTP, no sandbox). Both arms judged by the task's own SQL verifiers (deployable check): breadth@K — K INDEPENDENT shots on fresh seeded DBs; keep the best (resample) depth@K — K SEQUENTIAL steered shots over ONE persistent DB; the artifact accumulates and each re-engagement finishes what's left (steering) Equal compute (K×M turns). Partial-credit verifier score is the signal at this difficulty (full resolution is sparse — tasks have 8-18 verifiers). Gym standup (now unblocked): gym_dbs.zip is at the root of github.com/ServiceNow/EnterpriseOps-Gym (14.5MB, not LFS); unzip → EOPS_GYM_DBS_DIR; itsm container `shivakrishnareddyma225/enterpriseops-gym-mcp-itsm` on :8005. Directional (n=3 gpt-4.1): depth 44.4% vs breadth 27.8% score, +16.7pp — consistent with the prior depth>breadth +13.4pp, and OPPOSITE the HumanEval breadth-wins result. The boundary: breadth wins where samples are independent (stateless codegen); depth wins where the artifact is stateful and accumulates (agentic ops).

…failures Some gym MCP tools (e.g. map_change_request) ship inputSchemas with top-level oneOf/anyOf, which the router's tool-calling rejects (must be a plain object at the top level) — one such tool was aborting the whole run. sanitizeSchema coerces every tool to a valid `{type:'object'}` head (nested combinators are fine; properties kept). Per-task body is now try/caught: a failed task is excluded (logged), not fatal, so one bad task can't kill an n=24 run.

drewstone added 2 commits June 8, 2026 16:46

drewstone merged commit 6e8e6c0 into main Jun 8, 2026
1 check passed

drewstone deleted the feat/eops-depth-breadth-gate branch June 8, 2026 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): EOPS depth-vs-breadth gate (tool-using router worker, live gym)#200

feat(bench): EOPS depth-vs-breadth gate (tool-using router worker, live gym)#200
drewstone merged 2 commits into
mainfrom
feat/eops-depth-breadth-gate

drewstone commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 8, 2026

What

Gym standup (was the #985 blocker — now resolved)

Result (directional)

The boundary this draws

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant