feat(examples): ECHO on the Agent World Model env (real-env role masks) by thegovind · Pull Request #17 · thegovind/OpenEnv

thegovind · 2026-06-16T03:03:56Z

Companion to #16 (RFC 010). Proposal: #14.

Walkthrough video: add link before upstream submission

TL;DR

ECHO on a real upstream environment: envs/agent_world_model_env, the AgentWorldModel suite of 1,000 MCP tool-use worlds. Where #16 introduces the primitive on a toy terminal, this shows the role masks fall out of a real reset() / step() env you can run today.

The key point

An AWMObservation already separates the roles ECHO needs:

AWMObservation field	ECHO role
`tool_result`, `error`	`ENV_OUTPUT` (the free world-model target)
`verify_result`	`ENV_OUTPUT` (real grader output)
`warning`	`WARNING` (harness boilerplate, excluded by default)
agent tool call	`ACTION` (policy-gradient target)

So an AWM rollout is a ready-made ECHO trajectory.

Result (CPU, no downloads)

On a real captured e-commerce episode (search, offers, cart, add, verify):

context 1304 · action 588 · env_output 4659 · warning 0
free signal: 4659 / 5247 learnable tokens = 89%   (7.9x the action tokens)

89% of the learnable tokens are environment observations that standard agent-RL masks out. The ratio holds with a real BPE tokenizer (92%). This episode's verifier returns reward 0, so the policy-gradient term is exactly zero, yet ECHO still learns from 4,659 observation tokens. Sparse reward is exactly its motivating case.

Run

cd examples/echo_on_agent_world_model
python -m venv .venv && . .venv/bin/activate && pip install -r requirements.txt
python run_demo.py                                 # offline, deterministic
pytest -q                                          # 10 passed
python run_demo.py --live http://localhost:8899    # against a real AWM server

Upstream plan

This folds into #16 as a single upstream PR (RFC 010, the terminal reference, and this real-env demo, sharing one echo.py). Kept as a separate draft here only to record a focused walkthrough.

Token accounting plus the loss, not a trained benchmark. DCO signed. Closes part of #14.

Applies the ECHO objective (RFC 010 / #16) to the upstream agent_world_model_env (AgentWorldModel-1K). An AWM observation already separates real env output (tool_result/error/verify_result) from harness boilerplate (warning) — exactly ECHO's per-token action/observation/warning distinction — so an AWM rollout is a ready-made ECHO trajectory. Adds a self-contained, CPU-only example: - echo.py : role-masked trajectory + echo_loss (distilled from RFC 010) - awm.py : awm_episode_to_trajectory + live_capture (real AWM server) - run_demo.py: role accounting + loss three ways (GRPO / ECHO / verifier-free) - fixture + 8 passing tests On a real AWM e-commerce episode ~71% of learnable tokens are environment observations (~2.4x the action tokens) that standard agent-RL discards and ECHO recovers at ~zero extra compute; the ratio holds with a real BPE tokenizer (--hf). Relates to #14, #16, #12; substrate is AWM (upstream huggingface#428). Signed-off-by: Govind Kamtamneni <gok@microsoft.com>

No behavior bugs found; tightens correctness + honest framing: - verifier-free mode now shows literal pure env-token CE (coeff 1.0), not the λ-scaled term; demo/README/tests updated accordingly (+λ-scaling test). - relabel the action term 'GRPO-style / REINFORCE-style' and call out that advantage=reward is a 1-sequence stand-in for group-relative advantage (no ratio/clip/KL/critic) — avoids overclaiming GRPO. - live_capture now takes the real task/scenario/tool list from reset()/ list_tools() and releases the session via done() in a finally. - fold reward_type into error env-output payloads; scope the adapter to the CallToolAction+verify subset in docs. - scope offline numbers as one fixture ('this episode'); reconcile compute claim (no extra env interaction/rollout inference, small extra backward); add BPE segment-boundary tokenization caveat; fix --live PYTHONPATH. - drop dead n_ctx var + unused role-constant imports in run_demo. 9 tests pass; offline demo deterministic. Signed-off-by: Govind Kamtamneni <gok@microsoft.com>

…rden for upstream Verified end-to-end against a live agent_world_model_env server: - fixtures/awm_ecommerce_episode.json is now a REAL captured e_commerce_33 episode (real tool names search_products/list_product_offers/add_item_to_cart/ verify, real observations), not hand-authored. Added capture_episode.py that produced it (correct top-rated-under-$200 solution). - run_demo.py --live replays it against a running server and reproduces the accounting from genuine observations (confirmed live). - On real data the signal is stronger: 89% of learnable tokens are env observations (~7.9x the action tokens). The scenario's deterministic verifier returns no success signal (reward 0) -> policy-gradient term is exactly 0 while ECHO still learns from observations: a real, honest demonstration of ECHO's sparse/ambiguous-reward motivation. Demo reframed accordingly. - Lint: ruff check / ruff format --check / usort check all pass on the example (fixed F541 + F841, applied repo formatting). 10 tests pass. - Docs: real task/tools, captured-fixture provenance, live setup (uv + sqlalchemy + fastapi-mcp), hosted-Space option. Tests cover warning-role separation via a synthetic episode (real episode has no warnings). Signed-off-by: Govind Kamtamneni <gok@microsoft.com>

thegovind · 2026-06-17T16:18:17Z

Folded into #16, which is now the single PR (RFC 010 plus the toy terminal reference that trains plus this real-env AWM example, sharing the same role-mask schema). Closing as superseded.

thegovind mentioned this pull request Jun 16, 2026

[Proposal] RFC — Env-token world modeling (ECHO): per-token trajectory role masks + optimizer world-loss config #14

Open

thegovind added 2 commits June 15, 2026 20:15

thegovind mentioned this pull request Jun 17, 2026

feat: ECHO env-token world modeling (RFC 010 + runnable demo) #16

Closed

thegovind closed this Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): ECHO on the Agent World Model env (real-env role masks)#17

feat(examples): ECHO on the Agent World Model env (real-env role masks)#17
thegovind wants to merge 3 commits into
mainfrom
thegovind/feat/echo-on-agent-world-model

thegovind commented Jun 16, 2026 •

edited

Loading

Uh oh!

thegovind commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thegovind commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

The key point

Result (CPU, no downloads)

Run

Upstream plan

Uh oh!

thegovind commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thegovind commented Jun 16, 2026 •

edited

Loading