Skip to content

Add AWM expert-in-the-loop benchmark recipe and honest eval assets#833

Merged
burtenshaw merged 1 commit into
huggingface:mainfrom
thegovind:thegovind/feat/awm-expert-in-the-loop
Jun 22, 2026
Merged

Add AWM expert-in-the-loop benchmark recipe and honest eval assets#833
burtenshaw merged 1 commit into
huggingface:mainfrom
thegovind:thegovind/feat/awm-expert-in-the-loop

Conversation

@thegovind

@thegovind thegovind commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

What this PR adds

This PR adds a self-contained Agent World Model (AWM) expert-in-the-loop benchmark recipe under examples/awm_expert_in_the_loop/.

AWM is a suite of synthetic tool-use environments where each task exposes MCP tools, database state, and a verifier. This example explores a training pattern: give the agent a virtual ask_expert tool backed by a frontier model, then measure whether that advice improves tool-use behavior.

This ports and updates the expert-in-the-loop idea from #428 onto the current agent_world_model_env. It is a clean follow-up rather than a branch merge, because #428 predates the hardened AWM environment that is already on main.

Design

The expert is client-side only.

  • ask_expert is never registered with the AWM server.
  • It never appears in list_tools().
  • The environment verifier and reward stay inside AWM.
  • Verifier code is read from the public Snowflake/AgentWorldModel-1K dataset, not from server internals.
  • The AWM server and OpenEnv core APIs are untouched.

The example includes:

File Purpose
rollout.py One AWM episode from reset to verify
expert.py Verifier-informed frontier advisor
run_benchmark.py Baseline vs expert benchmark runner
build_split.py Builds non-trivial task splits
splits/workflow_automation.json Pinned 63 train / 33 validation split
assets/ Research plots and result summary
EXPERT_ENHANCEMENT.md Run notes and interpretation

Result

The backend run and held-out evaluation succeeded. On held-out verifier success, the expert and baseline tie exactly: 1/33 each, paired delta 0.0.

Held-out AWM validation

Training shaped reward is useful diagnostic context, but it is not the success metric. In this run the no-expert condition scored higher on shaped reward mostly because it took more tool turns.

Training rollout diagnostics

Condition Train steps Mean shaped train reward Max train batch completion Held-out complete rate
No expert 31 0.140 0.25 1/33
Expert available 31 0.046 0.25 1/33

The honest conclusion is:

Foundry Fine-Tuning backend training and held-out eval both ran successfully. This configuration does not yet show expert-in-the-loop improvement.

Why this is useful

This PR gives reviewers and future contributors a concrete, reproducible harness for experimenting with AWM expert curricula:

  • honest non-trivial task split
  • no-op verifier-pass filtering
  • client-side expert orchestration
  • verifier-success reporting
  • public plots and machine-readable result summary
  • no changes to the AWM server or OpenEnv core APIs

Attribution

This builds on the expert-in-the-loop track from #428. Original work is credited to @sfc-gh-mhidayetoglu and @sfc-gh-kganesan through commit co-author trailers and PR attribution.

Validation

Passing checks:

python3 -m py_compile examples/awm_expert_in_the_loop/*.py
uv run usort check examples/awm_expert_in_the_loop/
uv run ruff check examples/awm_expert_in_the_loop/
uv run ruff format --check examples/awm_expert_in_the_loop/
PYTHONPATH=src:envs:examples/awm_expert_in_the_loop uv run pytest examples/awm_expert_in_the_loop/test_rollout.py -q
PYTHONPATH=src:envs uv run pytest tests/ -q --tb=short

Observed locally:

  • Example package tests: 16 passed
  • Full test suite: 1309 passed, 128 skipped, 1 warning

Known baseline issue: the repository lint hook reports pre-existing formatting differences in unrelated envs/* files. Targeted lint and format checks for this example pass, and PR CI lint passes.

@bot-ci-comment

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@thegovind thegovind changed the title Add AWM expert-in-the-loop benchmark recipe Add AWM expert-in-the-loop benchmark recipe and honest eval assets Jun 22, 2026
Add a self-contained Agent World Model expert-in-the-loop recipe under
examples/awm_expert_in_the_loop. The expert is a client-side virtual ask_expert
tool, never registered with the AWM server or exposed through list_tools().
Verifier code is read from the public Snowflake/AgentWorldModel-1K dataset, so
client-server separation is preserved and the hardened AWM server remains
untouched.

The example includes:
- reusable rollout, expert, policy, and data helpers
- baseline vs expert benchmark runner with pinned split support
- honest workflow_automation split with no-op verifier passes filtered out
- pure unit tests for parser and LLM helper behavior
- public-safe research plots and run notes from completed Foundry Fine-Tuning
  training and held-out evaluation

The completed backend run is reported conservatively: both expert and no-expert
final checkpoints solve 1/33 held-out validation tasks, so this does not claim
an expert improvement yet.

Refs huggingface#428

Co-authored-by: Mert Hidayetoglu <178332432+sfc-gh-mhidayetoglu@users.noreply.github.com>
Co-authored-by: Karthik Ganesan <243232830+sfc-gh-kganesan@users.noreply.github.com>
@thegovind thegovind force-pushed the thegovind/feat/awm-expert-in-the-loop branch 2 times, most recently from 9c5dd38 to b632bd7 Compare June 22, 2026 02:32

@burtenshaw burtenshaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. And thanks for taking this across the line.

@burtenshaw burtenshaw merged commit ee5cb56 into huggingface:main Jun 22, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants