Skip to content

team mode: structural forcing for task-list coordination (5.6× lift in claim+update events)#73

Open
ProKil wants to merge 1 commit into
mainfrom
feat/team-coord-forcing-gate
Open

team mode: structural forcing for task-list coordination (5.6× lift in claim+update events)#73
ProKil wants to merge 1 commit into
mainfrom
feat/team-coord-forcing-gate

Conversation

@ProKil
Copy link
Copy Markdown
Member

@ProKil ProKil commented Jun 6, 2026

Summary

Adds a host-side gate to the team harness that turns the shared task list into a hard precondition for agent progress. With this gate active, a weak model (Qwen3.5-9B) goes from emitting coordination events on 10% of pairs to 56% — without distillation from a stronger teacher.

How it works

Two helpers in DefaultAgent.execute_actions intercept bash before env.execute:

  • _team_required_actions(cmd) — pure decision function. Two rules:
    1. unclaimed-own-task: auto-claim any of my tasks at status=open (unless cmd is itself a claim).
    2. submit-with-in-progress: when cmd contains the mini_swe submit sentinel and any of my tasks is status=in_progress, auto-update it to status=done.
  • _team_apply_prefix(cmd) — applies those actions server-side via the host TaskListClient and synthesizes a $ coop-task-claim …\n[auto] claimed: <title> line to prepend to the agent's observation. Server-side application sidesteps the in-container coop-task-* shell CLI which needs the redis python module — not always present in task base images, and the install snippet silently no-ops if pip can't reach the index.
  • _team_blocking_reason(cmd) — returns a refusal string when a rule fires that the harness can't auto-fix. Currently one case: a lead (task title starts with "Lead-only") trying to submit while a peer's task isn't status=done. Lead has to wait.

scripts/convert_team_to_coop.py also replaces the misleading P2P {from, to} shape in conversation.json with a broadcast {sender, sender_role, owner} shape. Task-list events aren't directed messages — they're broadcasts on a shared log. bench-runner is sender_role: system, not a peer; create events carry owner (the pre-assigned assignee); claim/update events have no recipient and no more self-loops.

Validation

Qwen3.5-9B + --setting team --team-no-protocol + CooperData v1 (345 has_conflict pairs across 26 repos):

metric unforced forced lift
pairs with ≥1 claim 100 (29.0%) 343 (99.7%) 3.4×
pairs with ≥1 update 49 (14.2%) 194 (56.4%) 4.0×
pairs with both events 35 (10.1%) 194 (56.4%) 5.6×
both-agent test pass 21 (6.1%) 23 (6.7%) ≈ unchanged

Both runs are published side-by-side on HF at CooperBench/team-coop, each with README.md + per-pair index.csv (token counts + coordination-signal columns). Report PR at cooperbench/CooperData#109.

The gate doesn't help or hurt task completion — it only adds coordination structure to trajectories that would otherwise be silent. That's exactly the load-bearing column for downstream training that wants to teach "communicate via the task list."

Test plan

  • 28 new unit tests in tests/agents/mini_swe_agent_v2/test_team_coord_gate.py (fakeredis, no docker). Covers all three rules + no-team-mode fallthrough + idempotence + broken-redis tolerance.
  • Full suite: 413 passed, 63 skipped.
  • ruff check, ruff format --check, mypy clean.
  • End-to-end validation on the 345-pair sweep above; trajectories published.

🤖 Generated with Claude Code

Adds a host-side gate that turns the team harness's task-list into a hard
precondition for agent progress. Two helpers in DefaultAgent.execute_actions
intercept bash before env.execute:

  _team_required_actions(cmd) → list of {kind, task_id, ...} the harness
    needs to apply before letting cmd through. Two rules:
      1. unclaimed-own-task: auto-claim any of my tasks at status=open
         (unless cmd is itself a claim).
      2. submit-with-in-progress: when cmd contains the mini_swe submit
         sentinel and any of my tasks is status=in_progress, auto-update
         it to status=done.

  _team_apply_prefix(cmd) → applies those actions server-side via the host
    TaskListClient and returns a synthesized "$ coop-task-claim …" output
    line to prepend to the agent's observation. Server-side application
    sidesteps the in-container coop-task-* shell CLI which depends on the
    redis python module — not always present in task base images and the
    install snippet silently no-ops if pip can't reach it.

  _team_blocking_reason(cmd) → returns a refusal string when a rule fires
    that the harness cannot auto-fix. Currently just one case: a lead
    (task title starts with "Lead-only") trying to submit while a peer's
    task is not yet status=done. The lead has to wait; no auto-action
    available.

scripts/convert_team_to_coop.py: replaces the legacy P2P {from, to} shape
in conversation.json with broadcast (sender, sender_role, owner). Task-list
events are broadcasts on a shared log, not directed messages:
  - bench-runner is sender_role: system, not a peer.
  - create events carry owner (the pre-assigned assignee), not to.
  - claim/update events have no recipient (broadcast); update events no
    longer have to be rendered as self-loops.

Validation on Qwen3.5-9B + team-no-protocol + CooperData v1 (345 pairs):
  pairs with ≥1 claim    100 (29%) → 343 (99.7%) — 3.4×
  pairs with ≥1 update    49 (14%) → 194 (56.4%) — 4.0×
  pairs with both        35 (10%) → 194 (56.4%) — 5.6×
  both-agent test pass    21 (6.1%) → 23 (6.7%)  — unchanged

Both runs live at CooperBench/team-coop on HF (baseline + forced subdirs);
report PR at cooperbench/CooperData#109.

Tests: 28 unit tests in tests/agents/mini_swe_agent_v2/test_team_coord_gate.py
covering the gate logic via fakeredis. All 413 tests + 63 skipped pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant