openhands_sdk: local docker backend + patch.txt submission#69
Open
akhatua2 wants to merge 1 commit into
Open
Conversation
Adds an end-to-end docker path to the openhands_sdk adapter so the full coop+git benchmark can run on a developer laptop without any Modal dependency. Aligns the submission flow with mini_swe_agent_v2 (agent writes its unified diff to patch.txt; the harness cats that file). ## Adapter changes (src/cooperbench/agents/openhands_agent_sdk/adapter.py) - New `DockerSandboxContext` mirroring `ModalSandboxContext`. Runs the agent-server `-oh` image via `docker run -d --rm --platform linux/amd64 -p 0:8000`, joins the shared `cooperbench` network when git is enabled, passes credentials via `--env-file`, polls `/health`, returns `http://localhost:<host_port>`. Cleans up via `docker rm -f` (skipping the graceful stop that timed out under concurrent load). - `_collect_sandbox_credentials()` and `_wait_for_agent_server()` factored to module scope so both Modal and Docker contexts reuse them. Credentials are filtered through `rewrite_comm_url_for_container()` so the in-container `REDIS_URL` resolves to `host.docker.internal`. - Backend branching in `run()`: read `config["backend"]` (default `"docker"`), skip Modal redis/git creation when on docker, pick `DockerSandboxContext` vs `ModalSandboxContext`. - Submission flow now matches mini_swe_agent_v2: the prompt instructs the agent to write its diff to `patch.txt`, and `_extract_patch` / the `base_commit` capture were deleted in favor of a simple `cat patch.txt` after the agent finishes. New `_submission_instructions(is_coop)` helper. ## Runner change (src/cooperbench/runner/coop.py) - Relaxed the `openhands_sdk` git-server-skip from "always" to "modal-only". On docker, the adapter now uses the shared `DockerGitServer` like every other adapter. ## Eval change (src/cooperbench/eval/backends/docker.py) - Eval `containers.run` tries the host's native platform first; on "no matching manifest" / platform-mismatch it retries with `platform=linux/amd64`. Benchmark base images are a mix of arm64-native and amd64-only, so neither pin alone is correct on Apple Silicon. ## Validation End-to-end validated by a full 50-pair `flash`-subset coop+git run on docker backend (`gemini/gemini-3-flash-preview`): 50/50 pairs Submitted, 9 both_passed (18%), 81 min wall time, $33.22 total. Zero pipeline errors; six pairs hit the upstream OpenHands SDK stuck-detection (known intermittent, surfaces in Modal runs too). CI: - `uv run ruff check src/cooperbench/` — clean - `uv run ruff format --check src/cooperbench/` — clean - `uv run python -m mypy src/cooperbench/` — clean - `uv run python -m pytest tests/ -v` — 385 passed, 63 skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
openhands_sdk— no more Modal dependency. NewDockerSandboxContextruns the agent-server image locally with the cooperbench docker network for git,host.docker.internalfor redis. Modal path stays intact.mini_swe_agent_v2— agent writes its diff topatch.txt, harness reads it. Drops the oldgit diff <base_commit>extraction.eval/backends/docker.pynow tries native first and falls back tolinux/amd64so Apple Silicon hosts work for either.Test plan
ruff/ruff format/mypy/pytest— 385 pass, 63 skip)flashsubset, 50 pairs,--setting coop --git -c 10 -a openhands_sdk -m gemini/gemini-3-flash-preview --backend docker. Result: 50/50 Submitted, 9 both_passed (18%), 81 min wall, $33.22 total. Zero pipeline / docker / eval errors. Six pairs hit the upstream OpenHands SDK stuck-detection (known intermittent — also surfaces under Modal).llama_index_task/18813 [1,2]: produced unified-diff patches at canonical pair locations, eval.json withapply_status: appliedand clean naive merge.DockerSandboxContextcleanup path: it usesdocker rm -fdirectly (skipping the graceful stop that timed out at high concurrency).