feat(examples): ECHO on the Agent World Model env (real-env role masks)#17
Closed
thegovind wants to merge 3 commits into
Closed
feat(examples): ECHO on the Agent World Model env (real-env role masks)#17thegovind wants to merge 3 commits into
thegovind wants to merge 3 commits into
Conversation
Applies the ECHO objective (RFC 010 / #16) to the upstream agent_world_model_env (AgentWorldModel-1K). An AWM observation already separates real env output (tool_result/error/verify_result) from harness boilerplate (warning) — exactly ECHO's per-token action/observation/warning distinction — so an AWM rollout is a ready-made ECHO trajectory. Adds a self-contained, CPU-only example: - echo.py : role-masked trajectory + echo_loss (distilled from RFC 010) - awm.py : awm_episode_to_trajectory + live_capture (real AWM server) - run_demo.py: role accounting + loss three ways (GRPO / ECHO / verifier-free) - fixture + 8 passing tests On a real AWM e-commerce episode ~71% of learnable tokens are environment observations (~2.4x the action tokens) that standard agent-RL discards and ECHO recovers at ~zero extra compute; the ratio holds with a real BPE tokenizer (--hf). Relates to #14, #16, #12; substrate is AWM (upstream huggingface#428). Signed-off-by: Govind Kamtamneni <gok@microsoft.com>
No behavior bugs found; tightens correctness + honest framing:
- verifier-free mode now shows literal pure env-token CE (coeff 1.0), not
the λ-scaled term; demo/README/tests updated accordingly (+λ-scaling test).
- relabel the action term 'GRPO-style / REINFORCE-style' and call out that
advantage=reward is a 1-sequence stand-in for group-relative advantage
(no ratio/clip/KL/critic) — avoids overclaiming GRPO.
- live_capture now takes the real task/scenario/tool list from reset()/
list_tools() and releases the session via done() in a finally.
- fold reward_type into error env-output payloads; scope the adapter to the
CallToolAction+verify subset in docs.
- scope offline numbers as one fixture ('this episode'); reconcile compute
claim (no extra env interaction/rollout inference, small extra backward);
add BPE segment-boundary tokenization caveat; fix --live PYTHONPATH.
- drop dead n_ctx var + unused role-constant imports in run_demo.
9 tests pass; offline demo deterministic.
Signed-off-by: Govind Kamtamneni <gok@microsoft.com>
…rden for upstream Verified end-to-end against a live agent_world_model_env server: - fixtures/awm_ecommerce_episode.json is now a REAL captured e_commerce_33 episode (real tool names search_products/list_product_offers/add_item_to_cart/ verify, real observations), not hand-authored. Added capture_episode.py that produced it (correct top-rated-under-$200 solution). - run_demo.py --live replays it against a running server and reproduces the accounting from genuine observations (confirmed live). - On real data the signal is stronger: 89% of learnable tokens are env observations (~7.9x the action tokens). The scenario's deterministic verifier returns no success signal (reward 0) -> policy-gradient term is exactly 0 while ECHO still learns from observations: a real, honest demonstration of ECHO's sparse/ambiguous-reward motivation. Demo reframed accordingly. - Lint: ruff check / ruff format --check / usort check all pass on the example (fixed F541 + F841, applied repo formatting). 10 tests pass. - Docs: real task/tools, captured-fixture provenance, live setup (uv + sqlalchemy + fastapi-mcp), hosted-Space option. Tests cover warning-role separation via a synthetic episode (real episode has no warnings). Signed-off-by: Govind Kamtamneni <gok@microsoft.com>
Owner
Author
|
Folded into #16, which is now the single PR (RFC 010 plus the toy terminal reference that trains plus this real-env AWM example, sharing the same role-mask schema). Closing as superseded. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
ECHO on a real upstream environment:
envs/agent_world_model_env, the AgentWorldModel suite of 1,000 MCP tool-use worlds. Where #16 introduces the primitive on a toy terminal, this shows the role masks fall out of a realreset()/step()env you can run today.The key point
An
AWMObservationalready separates the roles ECHO needs:tool_result,errorENV_OUTPUT(the free world-model target)verify_resultENV_OUTPUT(real grader output)warningWARNING(harness boilerplate, excluded by default)ACTION(policy-gradient target)So an AWM rollout is a ready-made ECHO trajectory.
Result (CPU, no downloads)
On a real captured e-commerce episode (search, offers, cart, add, verify):
89% of the learnable tokens are environment observations that standard agent-RL masks out. The ratio holds with a real BPE tokenizer (92%). This episode's verifier returns reward 0, so the policy-gradient term is exactly zero, yet ECHO still learns from 4,659 observation tokens. Sparse reward is exactly its motivating case.
Run
Upstream plan
This folds into #16 as a single upstream PR (RFC 010, the terminal reference, and this real-env demo, sharing one
echo.py). Kept as a separate draft here only to record a focused walkthrough.Token accounting plus the loss, not a trained benchmark. DCO signed. Closes part of #14.