Skip to content

feat(bench): EOPS depth-vs-breadth gate (tool-using router worker, live gym)#200

Merged
drewstone merged 2 commits into
mainfrom
feat/eops-depth-breadth-gate
Jun 8, 2026
Merged

feat(bench): EOPS depth-vs-breadth gate (tool-using router worker, live gym)#200
drewstone merged 2 commits into
mainfrom
feat/eops-depth-breadth-gate

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

The agentic, stateful counterpart to the HumanEval gates (where breadth/resampling won). The worker is the tool-using router backend (routerToolLoop, #199-class): it calls the EnterpriseOps-Gym live MCP tools, sees results, and acts — off-box (router inference + host→gym HTTP, no sandbox). Both arms judged by each task's own SQL verifiers (deployable check).

breadth@K — K INDEPENDENT shots on fresh seeded DBs; keep the best (resample)
depth@K   — K SEQUENTIAL steered shots over ONE persistent DB; the artifact
            accumulates, each re-engagement finishes what's left (steering)

Equal compute (K×M turns). Partial-credit verifier score is the signal at this difficulty (tasks have 8–18 verifiers; full resolution is sparse).

Gym standup (was the #985 blocker — now resolved)

gym_dbs.zip is at the root of github.com/ServiceNow/EnterpriseOps-Gym (14.5MB, not LFS — the ServiceNow-AI HF org 404 was the red herring). Unzip → EOPS_GYM_DBS_DIR; itsm container shivakrishnareddyma225/enterpriseops-gym-mcp-itsm on :8005. Substrate verified: seed→200, 93 MCP tools, tool calls + SQL verifiers all green.

Result (directional)

n=3, gpt-4.1: depth 44.4% vs breadth 27.8% verifier score → +16.7pp (CI spans 0 at n=3). Consistent with the prior depth>breadth +13.4pp, and opposite the HumanEval breadth-wins result. The full n=24 run is in flight; I'll post the paired-bootstrap verdict here.

The boundary this draws

Breadth/resampling wins where samples are independent (stateless codegen — HumanEval ×2). Depth/steering wins where the artifact is stateful and accumulates (agentic ops — EOPS). That's the product thesis, now tested from both sides on deployable checkers.

Test

typecheck clean; smoke + n=3 ran end-to-end against the live gym (seed/tools/act/verify), worker confirmed acting (6–15 tool calls/task).

drewstone added 2 commits June 8, 2026 16:46
…ve gym)

The agentic, stateful counterpart to the HumanEval gates (where breadth won). The
worker is the tool-using router backend (routerToolLoop): it calls the EOPS gym's
live MCP tools, sees results, and acts — off-box (router inference + host→gym HTTP,
no sandbox). Both arms judged by the task's own SQL verifiers (deployable check):

  breadth@K — K INDEPENDENT shots on fresh seeded DBs; keep the best (resample)
  depth@K   — K SEQUENTIAL steered shots over ONE persistent DB; the artifact
              accumulates and each re-engagement finishes what's left (steering)

Equal compute (K×M turns). Partial-credit verifier score is the signal at this
difficulty (full resolution is sparse — tasks have 8-18 verifiers).

Gym standup (now unblocked): gym_dbs.zip is at the root of
github.com/ServiceNow/EnterpriseOps-Gym (14.5MB, not LFS); unzip → EOPS_GYM_DBS_DIR;
itsm container `shivakrishnareddyma225/enterpriseops-gym-mcp-itsm` on :8005.

Directional (n=3 gpt-4.1): depth 44.4% vs breadth 27.8% score, +16.7pp — consistent
with the prior depth>breadth +13.4pp, and OPPOSITE the HumanEval breadth-wins result.
The boundary: breadth wins where samples are independent (stateless codegen); depth
wins where the artifact is stateful and accumulates (agentic ops).
…failures

Some gym MCP tools (e.g. map_change_request) ship inputSchemas with top-level
oneOf/anyOf, which the router's tool-calling rejects (must be a plain object at the
top level) — one such tool was aborting the whole run. sanitizeSchema coerces every
tool to a valid `{type:'object'}` head (nested combinators are fine; properties
kept). Per-task body is now try/caught: a failed task is excluded (logged), not
fatal, so one bad task can't kill an n=24 run.
@drewstone drewstone merged commit 6e8e6c0 into main Jun 8, 2026
1 check passed
@drewstone drewstone deleted the feat/eops-depth-breadth-gate branch June 8, 2026 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant