Add AWM expert-in-the-loop benchmark recipe and honest eval assets#833
Merged
burtenshaw merged 1 commit intoJun 22, 2026
Merged
Conversation
12 tasks
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Add a self-contained Agent World Model expert-in-the-loop recipe under examples/awm_expert_in_the_loop. The expert is a client-side virtual ask_expert tool, never registered with the AWM server or exposed through list_tools(). Verifier code is read from the public Snowflake/AgentWorldModel-1K dataset, so client-server separation is preserved and the hardened AWM server remains untouched. The example includes: - reusable rollout, expert, policy, and data helpers - baseline vs expert benchmark runner with pinned split support - honest workflow_automation split with no-op verifier passes filtered out - pure unit tests for parser and LLM helper behavior - public-safe research plots and run notes from completed Foundry Fine-Tuning training and held-out evaluation The completed backend run is reported conservatively: both expert and no-expert final checkpoints solve 1/33 held-out validation tasks, so this does not claim an expert improvement yet. Refs huggingface#428 Co-authored-by: Mert Hidayetoglu <178332432+sfc-gh-mhidayetoglu@users.noreply.github.com> Co-authored-by: Karthik Ganesan <243232830+sfc-gh-kganesan@users.noreply.github.com>
9c5dd38 to
b632bd7
Compare
burtenshaw
approved these changes
Jun 22, 2026
burtenshaw
left a comment
Collaborator
There was a problem hiding this comment.
Nice work. And thanks for taking this across the line.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR adds
This PR adds a self-contained Agent World Model (AWM) expert-in-the-loop benchmark recipe under
examples/awm_expert_in_the_loop/.AWM is a suite of synthetic tool-use environments where each task exposes MCP tools, database state, and a verifier. This example explores a training pattern: give the agent a virtual
ask_experttool backed by a frontier model, then measure whether that advice improves tool-use behavior.This ports and updates the expert-in-the-loop idea from #428 onto the current
agent_world_model_env. It is a clean follow-up rather than a branch merge, because #428 predates the hardened AWM environment that is already onmain.Design
The expert is client-side only.
ask_expertis never registered with the AWM server.list_tools().Snowflake/AgentWorldModel-1Kdataset, not from server internals.The example includes:
rollout.pyexpert.pyrun_benchmark.pybuild_split.pysplits/workflow_automation.jsonassets/EXPERT_ENHANCEMENT.mdResult
The backend run and held-out evaluation succeeded. On held-out verifier success, the expert and baseline tie exactly: 1/33 each, paired delta 0.0.
Training shaped reward is useful diagnostic context, but it is not the success metric. In this run the no-expert condition scored higher on shaped reward mostly because it took more tool turns.
The honest conclusion is:
Why this is useful
This PR gives reviewers and future contributors a concrete, reproducible harness for experimenting with AWM expert curricula:
Attribution
This builds on the expert-in-the-loop track from #428. Original work is credited to @sfc-gh-mhidayetoglu and @sfc-gh-kganesan through commit co-author trailers and PR attribution.
Validation
Passing checks:
python3 -m py_compile examples/awm_expert_in_the_loop/*.py uv run usort check examples/awm_expert_in_the_loop/ uv run ruff check examples/awm_expert_in_the_loop/ uv run ruff format --check examples/awm_expert_in_the_loop/ PYTHONPATH=src:envs:examples/awm_expert_in_the_loop uv run pytest examples/awm_expert_in_the_loop/test_rollout.py -q PYTHONPATH=src:envs uv run pytest tests/ -q --tb=shortObserved locally:
Known baseline issue: the repository lint hook reports pre-existing formatting differences in unrelated
envs/*files. Targeted lint and format checks for this example pass, and PR CI lint passes.