team mode: structural forcing for task-list coordination (5.6× lift in claim+update events) by ProKil · Pull Request #73 · cooperbench/CooperBench

ProKil · 2026-06-06T05:50:33Z

Summary

Adds a host-side gate to the team harness that turns the shared task list into a hard precondition for agent progress. With this gate active, a weak model (Qwen3.5-9B) goes from emitting coordination events on 10% of pairs to 56% — without distillation from a stronger teacher.

How it works

Two helpers in DefaultAgent.execute_actions intercept bash before env.execute:

_team_required_actions(cmd) — pure decision function. Two rules:
1. unclaimed-own-task: auto-claim any of my tasks at status=open (unless cmd is itself a claim).
2. submit-with-in-progress: when cmd contains the mini_swe submit sentinel and any of my tasks is status=in_progress, auto-update it to status=done.
_team_apply_prefix(cmd) — applies those actions server-side via the host TaskListClient and synthesizes a $ coop-task-claim …\n[auto] claimed: <title> line to prepend to the agent's observation. Server-side application sidesteps the in-container coop-task-* shell CLI which needs the redis python module — not always present in task base images, and the install snippet silently no-ops if pip can't reach the index.
_team_blocking_reason(cmd) — returns a refusal string when a rule fires that the harness can't auto-fix. Currently one case: a lead (task title starts with "Lead-only") trying to submit while a peer's task isn't status=done. Lead has to wait.

scripts/convert_team_to_coop.py also replaces the misleading P2P {from, to} shape in conversation.json with a broadcast {sender, sender_role, owner} shape. Task-list events aren't directed messages — they're broadcasts on a shared log. bench-runner is sender_role: system, not a peer; create events carry owner (the pre-assigned assignee); claim/update events have no recipient and no more self-loops.

Validation

Qwen3.5-9B + --setting team --team-no-protocol + CooperData v1 (345 has_conflict pairs across 26 repos):

metric	unforced	forced	lift
pairs with ≥1 claim	100 (29.0%)	343 (99.7%)	3.4×
pairs with ≥1 update	49 (14.2%)	194 (56.4%)	4.0×
pairs with both events	35 (10.1%)	194 (56.4%)	5.6×
both-agent test pass	21 (6.1%)	23 (6.7%)	≈ unchanged

Both runs are published side-by-side on HF at CooperBench/team-coop, each with README.md + per-pair index.csv (token counts + coordination-signal columns). Report PR at cooperbench/CooperData#109.

The gate doesn't help or hurt task completion — it only adds coordination structure to trajectories that would otherwise be silent. That's exactly the load-bearing column for downstream training that wants to teach "communicate via the task list."

Test plan

28 new unit tests in tests/agents/mini_swe_agent_v2/test_team_coord_gate.py (fakeredis, no docker). Covers all three rules + no-team-mode fallthrough + idempotence + broken-redis tolerance.
Full suite: 413 passed, 63 skipped.
ruff check, ruff format --check, mypy clean.
End-to-end validation on the 345-pair sweep above; trajectories published.

🤖 Generated with Claude Code

Adds a host-side gate that turns the team harness's task-list into a hard precondition for agent progress. Two helpers in DefaultAgent.execute_actions intercept bash before env.execute: _team_required_actions(cmd) → list of {kind, task_id, ...} the harness needs to apply before letting cmd through. Two rules: 1. unclaimed-own-task: auto-claim any of my tasks at status=open (unless cmd is itself a claim). 2. submit-with-in-progress: when cmd contains the mini_swe submit sentinel and any of my tasks is status=in_progress, auto-update it to status=done. _team_apply_prefix(cmd) → applies those actions server-side via the host TaskListClient and returns a synthesized "$ coop-task-claim …" output line to prepend to the agent's observation. Server-side application sidesteps the in-container coop-task-* shell CLI which depends on the redis python module — not always present in task base images and the install snippet silently no-ops if pip can't reach it. _team_blocking_reason(cmd) → returns a refusal string when a rule fires that the harness cannot auto-fix. Currently just one case: a lead (task title starts with "Lead-only") trying to submit while a peer's task is not yet status=done. The lead has to wait; no auto-action available. scripts/convert_team_to_coop.py: replaces the legacy P2P {from, to} shape in conversation.json with broadcast (sender, sender_role, owner). Task-list events are broadcasts on a shared log, not directed messages: - bench-runner is sender_role: system, not a peer. - create events carry owner (the pre-assigned assignee), not to. - claim/update events have no recipient (broadcast); update events no longer have to be rendered as self-loops. Validation on Qwen3.5-9B + team-no-protocol + CooperData v1 (345 pairs): pairs with ≥1 claim 100 (29%) → 343 (99.7%) — 3.4× pairs with ≥1 update 49 (14%) → 194 (56.4%) — 4.0× pairs with both 35 (10%) → 194 (56.4%) — 5.6× both-agent test pass 21 (6.1%) → 23 (6.7%) — unchanged Both runs live at CooperBench/team-coop on HF (baseline + forced subdirs); report PR at cooperbench/CooperData#109. Tests: 28 unit tests in tests/agents/mini_swe_agent_v2/test_team_coord_gate.py covering the gate logic via fakeredis. All 413 tests + 63 skipped pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

team mode: structural forcing for task-list coordination (5.6× lift in claim+update events)#73

team mode: structural forcing for task-list coordination (5.6× lift in claim+update events)#73
ProKil wants to merge 1 commit into
mainfrom
feat/team-coord-forcing-gate

ProKil commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ProKil commented Jun 6, 2026

Summary

How it works

Validation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant