feat(bench): EOPS depth-vs-breadth gate (tool-using router worker, live gym)#200
Merged
Conversation
…ve gym)
The agentic, stateful counterpart to the HumanEval gates (where breadth won). The
worker is the tool-using router backend (routerToolLoop): it calls the EOPS gym's
live MCP tools, sees results, and acts — off-box (router inference + host→gym HTTP,
no sandbox). Both arms judged by the task's own SQL verifiers (deployable check):
breadth@K — K INDEPENDENT shots on fresh seeded DBs; keep the best (resample)
depth@K — K SEQUENTIAL steered shots over ONE persistent DB; the artifact
accumulates and each re-engagement finishes what's left (steering)
Equal compute (K×M turns). Partial-credit verifier score is the signal at this
difficulty (full resolution is sparse — tasks have 8-18 verifiers).
Gym standup (now unblocked): gym_dbs.zip is at the root of
github.com/ServiceNow/EnterpriseOps-Gym (14.5MB, not LFS); unzip → EOPS_GYM_DBS_DIR;
itsm container `shivakrishnareddyma225/enterpriseops-gym-mcp-itsm` on :8005.
Directional (n=3 gpt-4.1): depth 44.4% vs breadth 27.8% score, +16.7pp — consistent
with the prior depth>breadth +13.4pp, and OPPOSITE the HumanEval breadth-wins result.
The boundary: breadth wins where samples are independent (stateless codegen); depth
wins where the artifact is stateful and accumulates (agentic ops).
…failures
Some gym MCP tools (e.g. map_change_request) ship inputSchemas with top-level
oneOf/anyOf, which the router's tool-calling rejects (must be a plain object at the
top level) — one such tool was aborting the whole run. sanitizeSchema coerces every
tool to a valid `{type:'object'}` head (nested combinators are fine; properties
kept). Per-task body is now try/caught: a failed task is excluded (logged), not
fatal, so one bad task can't kill an n=24 run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The agentic, stateful counterpart to the HumanEval gates (where breadth/resampling won). The worker is the tool-using router backend (
routerToolLoop, #199-class): it calls the EnterpriseOps-Gym live MCP tools, sees results, and acts — off-box (router inference + host→gym HTTP, no sandbox). Both arms judged by each task's own SQL verifiers (deployable check).Equal compute (K×M turns). Partial-credit verifier score is the signal at this difficulty (tasks have 8–18 verifiers; full resolution is sparse).
Gym standup (was the #985 blocker — now resolved)
gym_dbs.zipis at the root ofgithub.com/ServiceNow/EnterpriseOps-Gym(14.5MB, not LFS — theServiceNow-AIHF org 404 was the red herring). Unzip →EOPS_GYM_DBS_DIR; itsm containershivakrishnareddyma225/enterpriseops-gym-mcp-itsmon:8005. Substrate verified: seed→200, 93 MCP tools, tool calls + SQL verifiers all green.Result (directional)
n=3, gpt-4.1: depth 44.4% vs breadth 27.8% verifier score → +16.7pp (CI spans 0 at n=3). Consistent with the prior depth>breadth +13.4pp, and opposite the HumanEval breadth-wins result. The full n=24 run is in flight; I'll post the paired-bootstrap verdict here.
The boundary this draws
Breadth/resampling wins where samples are independent (stateless codegen — HumanEval ×2). Depth/steering wins where the artifact is stateful and accumulates (agentic ops — EOPS). That's the product thesis, now tested from both sides on deployable checkers.
Test
typecheck clean; smoke + n=3 ran end-to-end against the live gym (seed/tools/act/verify), worker confirmed acting (6–15 tool calls/task).