plan_execute: new two-phase setting (plan then execute) by akhatua2 · Pull Request #70 · cooperbench/CooperBench

akhatua2 · 2026-05-29T08:42:43Z

Summary

Adds --setting plan_execute: a two-phase variant of coop where agents first plan (coordinating via messaging + git, writing free-form plan.txt files) and then a fresh set of executor agents runs against each agent's own plan verbatim — no feature spec, no teammate plan, no Phase 1 conversation log.

The Phase 1 prompt explicitly frames the goal as "produce a plan such that your patches don't merge-conflict", making coordination the load-bearing skill being tested. Phase 2 keeps the full coop toolset (messaging + git stay on) so we measure whether good plans are enough or whether mid-execute coordination still matters.

Stacked on #69 (the docker backend + patch.txt flow this builds on).

What's new

runner/plan_execute.py — new orchestrator. Shared infra setup → Phase 1 with submission_artifact = plan.txt → Phase 2 with each agent's plan as task → eval against Phase 2 patches.
runner/coop.py — _run_pair_phase() + setup_pair_infra() extracted. execute_coop() composes them once, execute_plan_execute() composes them twice. _spawn_agent() gained task_override / extra_config / log_dir_override / setting_subdir kwargs.
agents/openhands_agent_sdk/adapter.py — _plan_submission_instructions(is_coop) helper + config-based submission_template / submission_artifact overrides so the same adapter handles both phases.
cli.py, runner/core.py — plan_execute setting wired into CLI + dispatch.
eval/runs.py, eval/evaluate.py — discover plan_execute/ log dirs, treat the setting like coop for eval (two patches, merge, test).

Log layout

logs/<run>/plan_execute/<repo>/<task>/<pair>/
  phase1/
    agent1.plan, agent2.plan, agent*_traj.json, conversation.json, result.json
  agent1.patch, agent2.patch       # phase 2 at canonical path → auto-eval works as-is
  agent*_traj.json, conversation.json
  result.json                       # rolled-up cost/steps + phase 2 statuses
  eval.json

Test plan

CI green (ruff / format / mypy / pytest — 385 pass, 63 skip)
Smoke on llama_index_task/18813 [1,2]: phase 1 produced a coordinated plan with explicit scope split ("Teammate (agent2) will handle AudioBlock.resolve_audio…"), 8 inter-agent messages. Phase 2 trajectory verified to contain ONLY the plan (feature.md title strings absent). Both patches applied, merge clean, eval ran end-to-end.
Full 50-pair flash subset run is in flight; will post per-pair comparison against openhands_sdk: local docker backend + patch.txt submission #69's baseline (50/50 Submitted, 9 both_passed = 18%) when it completes.

Scope limit (v1)

Only openhands_sdk is supported. CLI rejects other adapters with a clear NotImplementedError. mini_swe_agent_v2 etc. would need to honour the same config["submission_template"] and config["submission_artifact"] overrides.

🤖 Generated with Claude Code

Adds `--setting plan_execute`: agents first plan (writing `plan.txt`), then a fresh set of executor agents runs against each agent's own plan verbatim (no feature spec, no teammate plan, no Phase 1 conversation log). Tests whether explicit, coordinated planning improves merge cleanliness and pass rate over the single-phase coop. ## Design - Phase 1: both agents see the feature spec, full coop toolset (messaging + git). The prompt explicitly frames the goal as "produce a plan such that your patches don't merge-conflict" and instructs the agent to save its plan to `plan.txt`. - Phase 2: two fresh agent containers. Each agent's task message is its own `plan.txt` content. Full coop tools again, but no feature spec leakage — verified empirically: phase 2 trajectory contains the plan verbatim, not the feature.md text. Eval runs against Phase 2 patches. ## Code structure - `runner/coop.py`: extracted `_run_pair_phase()` (threaded per-agent spawn + per-agent artifact write) and `setup_pair_infra()` (redis namespace + git server). `execute_coop()` now composes them once; `execute_plan_execute()` composes them twice. `_spawn_agent()` gained `task_override` / `extra_config` / `log_dir_override` / `setting_subdir` keyword args. - `runner/plan_execute.py`: new orchestrator. Sets up shared infra, runs Phase 1 with `submission_template = plan-block` and `submission_artifact = "plan.txt"`, builds `plan_per_agent` dict, runs Phase 2 with `task_override_per_agent = plan_per_agent`, writes top-level + per-phase `result.json`. - `agents/openhands_agent_sdk/adapter.py`: new `_plan_submission_instructions(is_coop)` helper, plus config-based overrides (`submission_template`, `submission_artifact`) so the same adapter handles both phases. - `cli.py`: `plan_execute` added to `--setting` choices. - `runner/core.py`: dispatch to `execute_plan_execute` for the new setting. - `eval/runs.py`, `eval/evaluate.py`: discover `plan_execute/` log dirs, treat the setting like coop for eval (two patches, merge, test). ## Log layout ``` logs/<run>/plan_execute/<repo>/<task>/<pair>/ phase1/ agent1.plan, agent2.plan, agent*_traj.json, conversation.json, result.json agent1.patch, agent2.patch # phase 2 outputs at canonical path agent*_traj.json, conversation.json result.json # rolled-up cost/steps + phase 2 statuses eval.json # auto-eval against phase 2 patches ``` Phase 2 patches at the canonical pair location means existing auto-eval works unchanged. ## Validation - CI clean (ruff / ruff format / mypy / pytest 385 pass). - Smoke test on `llama_index_task/18813 [1,2]`: phase 1 produced a coordinated plan with explicit scope split between agents ("Teammate (agent2) will handle AudioBlock.resolve_audio"), 8 inter-agent messages. Phase 2 trajectory verified to contain ONLY the plan (feature.md title strings absent). Both patches applied, merge clean (naive), eval ran end-to-end. ## Out of scope (v1) - Only `openhands_sdk` is supported. CLI rejects other adapters with a clear `NotImplementedError`. mini_swe_agent_v2 etc. would need to honour the same `config["submission_template"]` and `config["submission_artifact"]` overrides. Stacked on top of #69 (openhands_sdk docker backend + patch.txt submission), which introduced the patch.txt-style flow this builds on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plan_execute: new two-phase setting (plan then execute)#70

plan_execute: new two-phase setting (plan then execute)#70
akhatua2 wants to merge 1 commit into
feat/openhands-docker-backendfrom
feat/plan-execute

akhatua2 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akhatua2 commented May 29, 2026

Summary

What's new

Log layout

Test plan

Scope limit (v1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant