plan_execute: new two-phase setting (plan then execute)#70
Open
akhatua2 wants to merge 1 commit into
Open
Conversation
Adds `--setting plan_execute`: agents first plan (writing `plan.txt`),
then a fresh set of executor agents runs against each agent's own plan
verbatim (no feature spec, no teammate plan, no Phase 1 conversation
log). Tests whether explicit, coordinated planning improves merge
cleanliness and pass rate over the single-phase coop.
## Design
- Phase 1: both agents see the feature spec, full coop toolset
(messaging + git). The prompt explicitly frames the goal as
"produce a plan such that your patches don't merge-conflict" and
instructs the agent to save its plan to `plan.txt`.
- Phase 2: two fresh agent containers. Each agent's task message is
its own `plan.txt` content. Full coop tools again, but no feature
spec leakage — verified empirically: phase 2 trajectory contains
the plan verbatim, not the feature.md text. Eval runs against
Phase 2 patches.
## Code structure
- `runner/coop.py`: extracted `_run_pair_phase()` (threaded per-agent
spawn + per-agent artifact write) and `setup_pair_infra()` (redis
namespace + git server). `execute_coop()` now composes them once;
`execute_plan_execute()` composes them twice. `_spawn_agent()`
gained `task_override` / `extra_config` / `log_dir_override` /
`setting_subdir` keyword args.
- `runner/plan_execute.py`: new orchestrator. Sets up shared infra,
runs Phase 1 with `submission_template = plan-block` and
`submission_artifact = "plan.txt"`, builds `plan_per_agent` dict,
runs Phase 2 with `task_override_per_agent = plan_per_agent`,
writes top-level + per-phase `result.json`.
- `agents/openhands_agent_sdk/adapter.py`: new
`_plan_submission_instructions(is_coop)` helper, plus config-based
overrides (`submission_template`, `submission_artifact`) so the
same adapter handles both phases.
- `cli.py`: `plan_execute` added to `--setting` choices.
- `runner/core.py`: dispatch to `execute_plan_execute` for the new
setting.
- `eval/runs.py`, `eval/evaluate.py`: discover `plan_execute/` log
dirs, treat the setting like coop for eval (two patches, merge,
test).
## Log layout
```
logs/<run>/plan_execute/<repo>/<task>/<pair>/
phase1/
agent1.plan, agent2.plan, agent*_traj.json, conversation.json, result.json
agent1.patch, agent2.patch # phase 2 outputs at canonical path
agent*_traj.json, conversation.json
result.json # rolled-up cost/steps + phase 2 statuses
eval.json # auto-eval against phase 2 patches
```
Phase 2 patches at the canonical pair location means existing auto-eval
works unchanged.
## Validation
- CI clean (ruff / ruff format / mypy / pytest 385 pass).
- Smoke test on `llama_index_task/18813 [1,2]`: phase 1 produced a
coordinated plan with explicit scope split between agents
("Teammate (agent2) will handle AudioBlock.resolve_audio"), 8
inter-agent messages. Phase 2 trajectory verified to contain ONLY
the plan (feature.md title strings absent). Both patches applied,
merge clean (naive), eval ran end-to-end.
## Out of scope (v1)
- Only `openhands_sdk` is supported. CLI rejects other adapters with
a clear `NotImplementedError`. mini_swe_agent_v2 etc. would need
to honour the same `config["submission_template"]` and
`config["submission_artifact"]` overrides.
Stacked on top of #69 (openhands_sdk docker backend + patch.txt
submission), which introduced the patch.txt-style flow this builds on.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
--setting plan_execute: a two-phase variant of coop where agents first plan (coordinating via messaging + git, writing free-formplan.txtfiles) and then a fresh set of executor agents runs against each agent's own plan verbatim — no feature spec, no teammate plan, no Phase 1 conversation log.The Phase 1 prompt explicitly frames the goal as "produce a plan such that your patches don't merge-conflict", making coordination the load-bearing skill being tested. Phase 2 keeps the full coop toolset (messaging + git stay on) so we measure whether good plans are enough or whether mid-execute coordination still matters.
Stacked on #69 (the docker backend +
patch.txtflow this builds on).What's new
runner/plan_execute.py— new orchestrator. Shared infra setup → Phase 1 withsubmission_artifact = plan.txt→ Phase 2 with each agent's plan as task → eval against Phase 2 patches.runner/coop.py—_run_pair_phase()+setup_pair_infra()extracted.execute_coop()composes them once,execute_plan_execute()composes them twice._spawn_agent()gainedtask_override/extra_config/log_dir_override/setting_subdirkwargs.agents/openhands_agent_sdk/adapter.py—_plan_submission_instructions(is_coop)helper + config-basedsubmission_template/submission_artifactoverrides so the same adapter handles both phases.cli.py,runner/core.py—plan_executesetting wired into CLI + dispatch.eval/runs.py,eval/evaluate.py— discoverplan_execute/log dirs, treat the setting like coop for eval (two patches, merge, test).Log layout
Test plan
llama_index_task/18813 [1,2]: phase 1 produced a coordinated plan with explicit scope split ("Teammate (agent2) will handle AudioBlock.resolve_audio…"), 8 inter-agent messages. Phase 2 trajectory verified to contain ONLY the plan (feature.md title strings absent). Both patches applied, merge clean, eval ran end-to-end.flashsubset run is in flight; will post per-pair comparison against openhands_sdk: local docker backend + patch.txt submission #69's baseline (50/50 Submitted, 9 both_passed = 18%) when it completes.Scope limit (v1)
Only
openhands_sdkis supported. CLI rejects other adapters with a clearNotImplementedError. mini_swe_agent_v2 etc. would need to honour the sameconfig["submission_template"]andconfig["submission_artifact"]overrides.🤖 Generated with Claude Code