openhands_sdk: local docker backend + patch.txt submission by akhatua2 · Pull Request #69 · cooperbench/CooperBench

akhatua2 · 2026-05-29T08:41:13Z

Summary

Local Docker backend for openhands_sdk — no more Modal dependency. New DockerSandboxContext runs the agent-server image locally with the cooperbench docker network for git, host.docker.internal for redis. Modal path stays intact.
Aligned submission flow with mini_swe_agent_v2 — agent writes its diff to patch.txt, harness reads it. Drops the old git diff <base_commit> extraction.
Eval platform fallback — benchmark base images are a mix of amd64-only and arm64-only; eval/backends/docker.py now tries native first and falls back to linux/amd64 so Apple Silicon hosts work for either.

Test plan

CI green (ruff / ruff format / mypy / pytest — 385 pass, 63 skip)
End-to-end full subset run: flash subset, 50 pairs, --setting coop --git -c 10 -a openhands_sdk -m gemini/gemini-3-flash-preview --backend docker. Result: 50/50 Submitted, 9 both_passed (18%), 81 min wall, $33.22 total. Zero pipeline / docker / eval errors. Six pairs hit the upstream OpenHands SDK stuck-detection (known intermittent — also surfaces under Modal).
Smoke tests on llama_index_task/18813 [1,2]: produced unified-diff patches at canonical pair locations, eval.json with apply_status: applied and clean naive merge.
Reviewer to check the new DockerSandboxContext cleanup path: it uses docker rm -f directly (skipping the graceful stop that timed out at high concurrency).

Adds an end-to-end docker path to the openhands_sdk adapter so the full coop+git benchmark can run on a developer laptop without any Modal dependency. Aligns the submission flow with mini_swe_agent_v2 (agent writes its unified diff to patch.txt; the harness cats that file). ## Adapter changes (src/cooperbench/agents/openhands_agent_sdk/adapter.py) - New `DockerSandboxContext` mirroring `ModalSandboxContext`. Runs the agent-server `-oh` image via `docker run -d --rm --platform linux/amd64 -p 0:8000`, joins the shared `cooperbench` network when git is enabled, passes credentials via `--env-file`, polls `/health`, returns `http://localhost:<host_port>`. Cleans up via `docker rm -f` (skipping the graceful stop that timed out under concurrent load). - `_collect_sandbox_credentials()` and `_wait_for_agent_server()` factored to module scope so both Modal and Docker contexts reuse them. Credentials are filtered through `rewrite_comm_url_for_container()` so the in-container `REDIS_URL` resolves to `host.docker.internal`. - Backend branching in `run()`: read `config["backend"]` (default `"docker"`), skip Modal redis/git creation when on docker, pick `DockerSandboxContext` vs `ModalSandboxContext`. - Submission flow now matches mini_swe_agent_v2: the prompt instructs the agent to write its diff to `patch.txt`, and `_extract_patch` / the `base_commit` capture were deleted in favor of a simple `cat patch.txt` after the agent finishes. New `_submission_instructions(is_coop)` helper. ## Runner change (src/cooperbench/runner/coop.py) - Relaxed the `openhands_sdk` git-server-skip from "always" to "modal-only". On docker, the adapter now uses the shared `DockerGitServer` like every other adapter. ## Eval change (src/cooperbench/eval/backends/docker.py) - Eval `containers.run` tries the host's native platform first; on "no matching manifest" / platform-mismatch it retries with `platform=linux/amd64`. Benchmark base images are a mix of arm64-native and amd64-only, so neither pin alone is correct on Apple Silicon. ## Validation End-to-end validated by a full 50-pair `flash`-subset coop+git run on docker backend (`gemini/gemini-3-flash-preview`): 50/50 pairs Submitted, 9 both_passed (18%), 81 min wall time, $33.22 total. Zero pipeline errors; six pairs hit the upstream OpenHands SDK stuck-detection (known intermittent, surfaces in Modal runs too). CI: - `uv run ruff check src/cooperbench/` — clean - `uv run ruff format --check src/cooperbench/` — clean - `uv run python -m mypy src/cooperbench/` — clean - `uv run python -m pytest tests/ -v` — 385 passed, 63 skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

akhatua2 mentioned this pull request May 29, 2026

plan_execute: new two-phase setting (plan then execute) #70

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openhands_sdk: local docker backend + patch.txt submission#69

openhands_sdk: local docker backend + patch.txt submission#69
akhatua2 wants to merge 1 commit into
mainfrom
feat/openhands-docker-backend

akhatua2 commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akhatua2 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

akhatua2 commented May 29, 2026 •

edited

Loading