Add AWM expert-in-the-loop benchmark recipe and honest eval assets by thegovind · Pull Request #833 · huggingface/OpenEnv

thegovind · 2026-06-22T01:42:16Z

What this PR adds

This PR adds a self-contained Agent World Model (AWM) expert-in-the-loop benchmark recipe under examples/awm_expert_in_the_loop/.

AWM is a suite of synthetic tool-use environments where each task exposes MCP tools, database state, and a verifier. This example explores a training pattern: give the agent a virtual ask_expert tool backed by a frontier model, then measure whether that advice improves tool-use behavior.

This ports and updates the expert-in-the-loop idea from #428 onto the current agent_world_model_env. It is a clean follow-up rather than a branch merge, because #428 predates the hardened AWM environment that is already on main.

Design

The expert is client-side only.

ask_expert is never registered with the AWM server.
It never appears in list_tools().
The environment verifier and reward stay inside AWM.
Verifier code is read from the public Snowflake/AgentWorldModel-1K dataset, not from server internals.
The AWM server and OpenEnv core APIs are untouched.

The example includes:

File	Purpose
`rollout.py`	One AWM episode from reset to verify
`expert.py`	Verifier-informed frontier advisor
`run_benchmark.py`	Baseline vs expert benchmark runner
`build_split.py`	Builds non-trivial task splits
`splits/workflow_automation.json`	Pinned 63 train / 33 validation split
`assets/`	Research plots and result summary
`EXPERT_ENHANCEMENT.md`	Run notes and interpretation

Result

The backend run and held-out evaluation succeeded. On held-out verifier success, the expert and baseline tie exactly: 1/33 each, paired delta 0.0.

Training shaped reward is useful diagnostic context, but it is not the success metric. In this run the no-expert condition scored higher on shaped reward mostly because it took more tool turns.

Condition	Train steps	Mean shaped train reward	Max train batch completion	Held-out complete rate
No expert	31	0.140	0.25	1/33
Expert available	31	0.046	0.25	1/33

The honest conclusion is:

Foundry Fine-Tuning backend training and held-out eval both ran successfully. This configuration does not yet show expert-in-the-loop improvement.

Why this is useful

This PR gives reviewers and future contributors a concrete, reproducible harness for experimenting with AWM expert curricula:

honest non-trivial task split
no-op verifier-pass filtering
client-side expert orchestration
verifier-success reporting
public plots and machine-readable result summary
no changes to the AWM server or OpenEnv core APIs

Attribution

This builds on the expert-in-the-loop track from #428. Original work is credited to @sfc-gh-mhidayetoglu and @sfc-gh-kganesan through commit co-author trailers and PR attribution.

Validation

Passing checks:

python3 -m py_compile examples/awm_expert_in_the_loop/*.py
uv run usort check examples/awm_expert_in_the_loop/
uv run ruff check examples/awm_expert_in_the_loop/
uv run ruff format --check examples/awm_expert_in_the_loop/
PYTHONPATH=src:envs:examples/awm_expert_in_the_loop uv run pytest examples/awm_expert_in_the_loop/test_rollout.py -q
PYTHONPATH=src:envs uv run pytest tests/ -q --tb=short

Observed locally:

Example package tests: 16 passed
Full test suite: 1309 passed, 128 skipped, 1 warning

Known baseline issue: the repository lint hook reports pre-existing formatting differences in unrelated envs/* files. Targeted lint and format checks for this example pass, and PR CI lint passes.

bot-ci-comment · 2026-06-22T01:44:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Add a self-contained Agent World Model expert-in-the-loop recipe under examples/awm_expert_in_the_loop. The expert is a client-side virtual ask_expert tool, never registered with the AWM server or exposed through list_tools(). Verifier code is read from the public Snowflake/AgentWorldModel-1K dataset, so client-server separation is preserved and the hardened AWM server remains untouched. The example includes: - reusable rollout, expert, policy, and data helpers - baseline vs expert benchmark runner with pinned split support - honest workflow_automation split with no-op verifier passes filtered out - pure unit tests for parser and LLM helper behavior - public-safe research plots and run notes from completed Foundry Fine-Tuning training and held-out evaluation The completed backend run is reported conservatively: both expert and no-expert final checkpoints solve 1/33 held-out validation tasks, so this does not claim an expert improvement yet. Refs huggingface#428 Co-authored-by: Mert Hidayetoglu <178332432+sfc-gh-mhidayetoglu@users.noreply.github.com> Co-authored-by: Karthik Ganesan <243232830+sfc-gh-kganesan@users.noreply.github.com>

burtenshaw

Nice work. And thanks for taking this across the line.

thegovind mentioned this pull request Jun 22, 2026

OpenEnv hackathon expert-in-the-loop track w/ AWM #428

Closed

12 tasks

thegovind changed the title ~~Add AWM expert-in-the-loop benchmark recipe~~ Add AWM expert-in-the-loop benchmark recipe and honest eval assets Jun 22, 2026

thegovind force-pushed the thegovind/feat/awm-expert-in-the-loop branch 2 times, most recently from 9c5dd38 to b632bd7 Compare June 22, 2026 02:32

burtenshaw approved these changes Jun 22, 2026

View reviewed changes

burtenshaw merged commit ee5cb56 into huggingface:main Jun 22, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AWM expert-in-the-loop benchmark recipe and honest eval assets#833

Add AWM expert-in-the-loop benchmark recipe and honest eval assets#833
burtenshaw merged 1 commit into
huggingface:mainfrom
thegovind:thegovind/feat/awm-expert-in-the-loop

thegovind commented Jun 22, 2026 •

edited

Loading

Uh oh!

bot-ci-comment Bot commented Jun 22, 2026

Uh oh!

burtenshaw left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

thegovind commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR adds

Design

Result

Why this is useful

Attribution

Validation

Uh oh!

bot-ci-comment Bot commented Jun 22, 2026

Uh oh!

burtenshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thegovind commented Jun 22, 2026 •

edited

Loading