Skip to content

feat(examples): ECHO on the Agent World Model env (real-env role masks)#17

Closed
thegovind wants to merge 3 commits into
mainfrom
thegovind/feat/echo-on-agent-world-model
Closed

feat(examples): ECHO on the Agent World Model env (real-env role masks)#17
thegovind wants to merge 3 commits into
mainfrom
thegovind/feat/echo-on-agent-world-model

Conversation

@thegovind

@thegovind thegovind commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Companion to #16 (RFC 010). Proposal: #14.

Walkthrough video: add link before upstream submission

TL;DR

ECHO on a real upstream environment: envs/agent_world_model_env, the AgentWorldModel suite of 1,000 MCP tool-use worlds. Where #16 introduces the primitive on a toy terminal, this shows the role masks fall out of a real reset() / step() env you can run today.

The key point

An AWMObservation already separates the roles ECHO needs:

AWMObservation field ECHO role
tool_result, error ENV_OUTPUT (the free world-model target)
verify_result ENV_OUTPUT (real grader output)
warning WARNING (harness boilerplate, excluded by default)
agent tool call ACTION (policy-gradient target)

So an AWM rollout is a ready-made ECHO trajectory.

Result (CPU, no downloads)

On a real captured e-commerce episode (search, offers, cart, add, verify):

context 1304 · action 588 · env_output 4659 · warning 0
free signal: 4659 / 5247 learnable tokens = 89%   (7.9x the action tokens)

89% of the learnable tokens are environment observations that standard agent-RL masks out. The ratio holds with a real BPE tokenizer (92%). This episode's verifier returns reward 0, so the policy-gradient term is exactly zero, yet ECHO still learns from 4,659 observation tokens. Sparse reward is exactly its motivating case.

Run

cd examples/echo_on_agent_world_model
python -m venv .venv && . .venv/bin/activate && pip install -r requirements.txt
python run_demo.py                                 # offline, deterministic
pytest -q                                          # 10 passed
python run_demo.py --live http://localhost:8899    # against a real AWM server

Upstream plan

This folds into #16 as a single upstream PR (RFC 010, the terminal reference, and this real-env demo, sharing one echo.py). Kept as a separate draft here only to record a focused walkthrough.

Token accounting plus the loss, not a trained benchmark. DCO signed. Closes part of #14.

Applies the ECHO objective (RFC 010 / #16) to the upstream
agent_world_model_env (AgentWorldModel-1K). An AWM observation already
separates real env output (tool_result/error/verify_result) from harness
boilerplate (warning) — exactly ECHO's per-token action/observation/warning
distinction — so an AWM rollout is a ready-made ECHO trajectory.

Adds a self-contained, CPU-only example:
- echo.py   : role-masked trajectory + echo_loss (distilled from RFC 010)
- awm.py     : awm_episode_to_trajectory + live_capture (real AWM server)
- run_demo.py: role accounting + loss three ways (GRPO / ECHO / verifier-free)
- fixture + 8 passing tests

On a real AWM e-commerce episode ~71% of learnable tokens are environment
observations (~2.4x the action tokens) that standard agent-RL discards and
ECHO recovers at ~zero extra compute; the ratio holds with a real BPE
tokenizer (--hf). Relates to #14, #16, #12; substrate is AWM (upstream huggingface#428).

Signed-off-by: Govind Kamtamneni <gok@microsoft.com>
No behavior bugs found; tightens correctness + honest framing:

- verifier-free mode now shows literal pure env-token CE (coeff 1.0), not
  the λ-scaled term; demo/README/tests updated accordingly (+λ-scaling test).
- relabel the action term 'GRPO-style / REINFORCE-style' and call out that
  advantage=reward is a 1-sequence stand-in for group-relative advantage
  (no ratio/clip/KL/critic) — avoids overclaiming GRPO.
- live_capture now takes the real task/scenario/tool list from reset()/
  list_tools() and releases the session via done() in a finally.
- fold reward_type into error env-output payloads; scope the adapter to the
  CallToolAction+verify subset in docs.
- scope offline numbers as one fixture ('this episode'); reconcile compute
  claim (no extra env interaction/rollout inference, small extra backward);
  add BPE segment-boundary tokenization caveat; fix --live PYTHONPATH.
- drop dead n_ctx var + unused role-constant imports in run_demo.

9 tests pass; offline demo deterministic.

Signed-off-by: Govind Kamtamneni <gok@microsoft.com>
…rden for upstream

Verified end-to-end against a live agent_world_model_env server:

- fixtures/awm_ecommerce_episode.json is now a REAL captured e_commerce_33
  episode (real tool names search_products/list_product_offers/add_item_to_cart/
  verify, real observations), not hand-authored. Added capture_episode.py that
  produced it (correct top-rated-under-$200 solution).
- run_demo.py --live replays it against a running server and reproduces the
  accounting from genuine observations (confirmed live).
- On real data the signal is stronger: 89% of learnable tokens are env
  observations (~7.9x the action tokens). The scenario's deterministic verifier
  returns no success signal (reward 0) -> policy-gradient term is exactly 0 while
  ECHO still learns from observations: a real, honest demonstration of ECHO's
  sparse/ambiguous-reward motivation. Demo reframed accordingly.
- Lint: ruff check / ruff format --check / usort check all pass on the example
  (fixed F541 + F841, applied repo formatting). 10 tests pass.
- Docs: real task/tools, captured-fixture provenance, live setup (uv + sqlalchemy
  + fastapi-mcp), hosted-Space option. Tests cover warning-role separation via a
  synthetic episode (real episode has no warnings).

Signed-off-by: Govind Kamtamneni <gok@microsoft.com>
@thegovind

Copy link
Copy Markdown
Owner Author

Folded into #16, which is now the single PR (RFC 010 plus the toy terminal reference that trains plus this real-env AWM example, sharing the same role-mask schema). Closing as superseded.

@thegovind thegovind closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant