team mode: structural forcing for task-list coordination (5.6× lift in claim+update events)#73
Open
ProKil wants to merge 1 commit into
Open
team mode: structural forcing for task-list coordination (5.6× lift in claim+update events)#73ProKil wants to merge 1 commit into
ProKil wants to merge 1 commit into
Conversation
Adds a host-side gate that turns the team harness's task-list into a hard
precondition for agent progress. Two helpers in DefaultAgent.execute_actions
intercept bash before env.execute:
_team_required_actions(cmd) → list of {kind, task_id, ...} the harness
needs to apply before letting cmd through. Two rules:
1. unclaimed-own-task: auto-claim any of my tasks at status=open
(unless cmd is itself a claim).
2. submit-with-in-progress: when cmd contains the mini_swe submit
sentinel and any of my tasks is status=in_progress, auto-update
it to status=done.
_team_apply_prefix(cmd) → applies those actions server-side via the host
TaskListClient and returns a synthesized "$ coop-task-claim …" output
line to prepend to the agent's observation. Server-side application
sidesteps the in-container coop-task-* shell CLI which depends on the
redis python module — not always present in task base images and the
install snippet silently no-ops if pip can't reach it.
_team_blocking_reason(cmd) → returns a refusal string when a rule fires
that the harness cannot auto-fix. Currently just one case: a lead
(task title starts with "Lead-only") trying to submit while a peer's
task is not yet status=done. The lead has to wait; no auto-action
available.
scripts/convert_team_to_coop.py: replaces the legacy P2P {from, to} shape
in conversation.json with broadcast (sender, sender_role, owner). Task-list
events are broadcasts on a shared log, not directed messages:
- bench-runner is sender_role: system, not a peer.
- create events carry owner (the pre-assigned assignee), not to.
- claim/update events have no recipient (broadcast); update events no
longer have to be rendered as self-loops.
Validation on Qwen3.5-9B + team-no-protocol + CooperData v1 (345 pairs):
pairs with ≥1 claim 100 (29%) → 343 (99.7%) — 3.4×
pairs with ≥1 update 49 (14%) → 194 (56.4%) — 4.0×
pairs with both 35 (10%) → 194 (56.4%) — 5.6×
both-agent test pass 21 (6.1%) → 23 (6.7%) — unchanged
Both runs live at CooperBench/team-coop on HF (baseline + forced subdirs);
report PR at cooperbench/CooperData#109.
Tests: 28 unit tests in tests/agents/mini_swe_agent_v2/test_team_coord_gate.py
covering the gate logic via fakeredis. All 413 tests + 63 skipped pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a host-side gate to the team harness that turns the shared task list into a hard precondition for agent progress. With this gate active, a weak model (Qwen3.5-9B) goes from emitting coordination events on 10% of pairs to 56% — without distillation from a stronger teacher.
How it works
Two helpers in
DefaultAgent.execute_actionsintercept bash beforeenv.execute:_team_required_actions(cmd)— pure decision function. Two rules:status=open(unless cmd is itself a claim).status=in_progress, auto-update it tostatus=done._team_apply_prefix(cmd)— applies those actions server-side via the hostTaskListClientand synthesizes a$ coop-task-claim …\n[auto] claimed: <title>line to prepend to the agent's observation. Server-side application sidesteps the in-containercoop-task-*shell CLI which needs theredispython module — not always present in task base images, and the install snippet silently no-ops if pip can't reach the index._team_blocking_reason(cmd)— returns a refusal string when a rule fires that the harness can't auto-fix. Currently one case: a lead (task title starts with"Lead-only") trying to submit while a peer's task isn'tstatus=done. Lead has to wait.scripts/convert_team_to_coop.pyalso replaces the misleading P2P{from, to}shape inconversation.jsonwith a broadcast{sender, sender_role, owner}shape. Task-list events aren't directed messages — they're broadcasts on a shared log.bench-runnerissender_role: system, not a peer;createevents carryowner(the pre-assigned assignee);claim/updateevents have no recipient and no more self-loops.Validation
Qwen3.5-9B +
--setting team --team-no-protocol+ CooperData v1 (345 has_conflict pairs across 26 repos):Both runs are published side-by-side on HF at
CooperBench/team-coop, each withREADME.md+ per-pairindex.csv(token counts + coordination-signal columns). Report PR at cooperbench/CooperData#109.The gate doesn't help or hurt task completion — it only adds coordination structure to trajectories that would otherwise be silent. That's exactly the load-bearing column for downstream training that wants to teach "communicate via the task list."
Test plan
tests/agents/mini_swe_agent_v2/test_team_coord_gate.py(fakeredis, no docker). Covers all three rules + no-team-mode fallthrough + idempotence + broken-redis tolerance.ruff check,ruff format --check,mypyclean.🤖 Generated with Claude Code