Skip to content

plan_execute: new two-phase setting (plan then execute)#70

Open
akhatua2 wants to merge 1 commit into
feat/openhands-docker-backendfrom
feat/plan-execute
Open

plan_execute: new two-phase setting (plan then execute)#70
akhatua2 wants to merge 1 commit into
feat/openhands-docker-backendfrom
feat/plan-execute

Conversation

@akhatua2

Copy link
Copy Markdown
Collaborator

Summary

Adds --setting plan_execute: a two-phase variant of coop where agents first plan (coordinating via messaging + git, writing free-form plan.txt files) and then a fresh set of executor agents runs against each agent's own plan verbatim — no feature spec, no teammate plan, no Phase 1 conversation log.

The Phase 1 prompt explicitly frames the goal as "produce a plan such that your patches don't merge-conflict", making coordination the load-bearing skill being tested. Phase 2 keeps the full coop toolset (messaging + git stay on) so we measure whether good plans are enough or whether mid-execute coordination still matters.

Stacked on #69 (the docker backend + patch.txt flow this builds on).

What's new

  • runner/plan_execute.py — new orchestrator. Shared infra setup → Phase 1 with submission_artifact = plan.txt → Phase 2 with each agent's plan as task → eval against Phase 2 patches.
  • runner/coop.py_run_pair_phase() + setup_pair_infra() extracted. execute_coop() composes them once, execute_plan_execute() composes them twice. _spawn_agent() gained task_override / extra_config / log_dir_override / setting_subdir kwargs.
  • agents/openhands_agent_sdk/adapter.py_plan_submission_instructions(is_coop) helper + config-based submission_template / submission_artifact overrides so the same adapter handles both phases.
  • cli.py, runner/core.pyplan_execute setting wired into CLI + dispatch.
  • eval/runs.py, eval/evaluate.py — discover plan_execute/ log dirs, treat the setting like coop for eval (two patches, merge, test).

Log layout

logs/<run>/plan_execute/<repo>/<task>/<pair>/
  phase1/
    agent1.plan, agent2.plan, agent*_traj.json, conversation.json, result.json
  agent1.patch, agent2.patch       # phase 2 at canonical path → auto-eval works as-is
  agent*_traj.json, conversation.json
  result.json                       # rolled-up cost/steps + phase 2 statuses
  eval.json

Test plan

  • CI green (ruff / format / mypy / pytest — 385 pass, 63 skip)
  • Smoke on llama_index_task/18813 [1,2]: phase 1 produced a coordinated plan with explicit scope split ("Teammate (agent2) will handle AudioBlock.resolve_audio…"), 8 inter-agent messages. Phase 2 trajectory verified to contain ONLY the plan (feature.md title strings absent). Both patches applied, merge clean, eval ran end-to-end.
  • Full 50-pair flash subset run is in flight; will post per-pair comparison against openhands_sdk: local docker backend + patch.txt submission #69's baseline (50/50 Submitted, 9 both_passed = 18%) when it completes.

Scope limit (v1)

Only openhands_sdk is supported. CLI rejects other adapters with a clear NotImplementedError. mini_swe_agent_v2 etc. would need to honour the same config["submission_template"] and config["submission_artifact"] overrides.

🤖 Generated with Claude Code

Adds `--setting plan_execute`: agents first plan (writing `plan.txt`),
then a fresh set of executor agents runs against each agent's own plan
verbatim (no feature spec, no teammate plan, no Phase 1 conversation
log). Tests whether explicit, coordinated planning improves merge
cleanliness and pass rate over the single-phase coop.

## Design

- Phase 1: both agents see the feature spec, full coop toolset
  (messaging + git). The prompt explicitly frames the goal as
  "produce a plan such that your patches don't merge-conflict" and
  instructs the agent to save its plan to `plan.txt`.
- Phase 2: two fresh agent containers. Each agent's task message is
  its own `plan.txt` content. Full coop tools again, but no feature
  spec leakage — verified empirically: phase 2 trajectory contains
  the plan verbatim, not the feature.md text. Eval runs against
  Phase 2 patches.

## Code structure

- `runner/coop.py`: extracted `_run_pair_phase()` (threaded per-agent
  spawn + per-agent artifact write) and `setup_pair_infra()` (redis
  namespace + git server). `execute_coop()` now composes them once;
  `execute_plan_execute()` composes them twice. `_spawn_agent()`
  gained `task_override` / `extra_config` / `log_dir_override` /
  `setting_subdir` keyword args.
- `runner/plan_execute.py`: new orchestrator. Sets up shared infra,
  runs Phase 1 with `submission_template = plan-block` and
  `submission_artifact = "plan.txt"`, builds `plan_per_agent` dict,
  runs Phase 2 with `task_override_per_agent = plan_per_agent`,
  writes top-level + per-phase `result.json`.
- `agents/openhands_agent_sdk/adapter.py`: new
  `_plan_submission_instructions(is_coop)` helper, plus config-based
  overrides (`submission_template`, `submission_artifact`) so the
  same adapter handles both phases.
- `cli.py`: `plan_execute` added to `--setting` choices.
- `runner/core.py`: dispatch to `execute_plan_execute` for the new
  setting.
- `eval/runs.py`, `eval/evaluate.py`: discover `plan_execute/` log
  dirs, treat the setting like coop for eval (two patches, merge,
  test).

## Log layout

```
logs/<run>/plan_execute/<repo>/<task>/<pair>/
  phase1/
    agent1.plan, agent2.plan, agent*_traj.json, conversation.json, result.json
  agent1.patch, agent2.patch         # phase 2 outputs at canonical path
  agent*_traj.json, conversation.json
  result.json                        # rolled-up cost/steps + phase 2 statuses
  eval.json                          # auto-eval against phase 2 patches
```

Phase 2 patches at the canonical pair location means existing auto-eval
works unchanged.

## Validation

- CI clean (ruff / ruff format / mypy / pytest 385 pass).
- Smoke test on `llama_index_task/18813 [1,2]`: phase 1 produced a
  coordinated plan with explicit scope split between agents
  ("Teammate (agent2) will handle AudioBlock.resolve_audio"), 8
  inter-agent messages. Phase 2 trajectory verified to contain ONLY
  the plan (feature.md title strings absent). Both patches applied,
  merge clean (naive), eval ran end-to-end.

## Out of scope (v1)

- Only `openhands_sdk` is supported. CLI rejects other adapters with
  a clear `NotImplementedError`. mini_swe_agent_v2 etc. would need
  to honour the same `config["submission_template"]` and
  `config["submission_artifact"]` overrides.

Stacked on top of #69 (openhands_sdk docker backend + patch.txt
submission), which introduced the patch.txt-style flow this builds on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant