Skip to content

openhands_sdk: local docker backend + patch.txt submission#69

Open
akhatua2 wants to merge 1 commit into
mainfrom
feat/openhands-docker-backend
Open

openhands_sdk: local docker backend + patch.txt submission#69
akhatua2 wants to merge 1 commit into
mainfrom
feat/openhands-docker-backend

Conversation

@akhatua2

@akhatua2 akhatua2 commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Local Docker backend for openhands_sdk — no more Modal dependency. New DockerSandboxContext runs the agent-server image locally with the cooperbench docker network for git, host.docker.internal for redis. Modal path stays intact.
  • Aligned submission flow with mini_swe_agent_v2 — agent writes its diff to patch.txt, harness reads it. Drops the old git diff <base_commit> extraction.
  • Eval platform fallback — benchmark base images are a mix of amd64-only and arm64-only; eval/backends/docker.py now tries native first and falls back to linux/amd64 so Apple Silicon hosts work for either.

Test plan

  • CI green (ruff / ruff format / mypy / pytest — 385 pass, 63 skip)
  • End-to-end full subset run: flash subset, 50 pairs, --setting coop --git -c 10 -a openhands_sdk -m gemini/gemini-3-flash-preview --backend docker. Result: 50/50 Submitted, 9 both_passed (18%), 81 min wall, $33.22 total. Zero pipeline / docker / eval errors. Six pairs hit the upstream OpenHands SDK stuck-detection (known intermittent — also surfaces under Modal).
  • Smoke tests on llama_index_task/18813 [1,2]: produced unified-diff patches at canonical pair locations, eval.json with apply_status: applied and clean naive merge.
  • Reviewer to check the new DockerSandboxContext cleanup path: it uses docker rm -f directly (skipping the graceful stop that timed out at high concurrency).

Adds an end-to-end docker path to the openhands_sdk adapter so the full
coop+git benchmark can run on a developer laptop without any Modal
dependency. Aligns the submission flow with mini_swe_agent_v2 (agent
writes its unified diff to patch.txt; the harness cats that file).

## Adapter changes (src/cooperbench/agents/openhands_agent_sdk/adapter.py)

- New `DockerSandboxContext` mirroring `ModalSandboxContext`. Runs the
  agent-server `-oh` image via `docker run -d --rm --platform linux/amd64
  -p 0:8000`, joins the shared `cooperbench` network when git is enabled,
  passes credentials via `--env-file`, polls `/health`, returns
  `http://localhost:<host_port>`. Cleans up via `docker rm -f` (skipping
  the graceful stop that timed out under concurrent load).
- `_collect_sandbox_credentials()` and `_wait_for_agent_server()`
  factored to module scope so both Modal and Docker contexts reuse them.
  Credentials are filtered through `rewrite_comm_url_for_container()` so
  the in-container `REDIS_URL` resolves to `host.docker.internal`.
- Backend branching in `run()`: read `config["backend"]` (default
  `"docker"`), skip Modal redis/git creation when on docker, pick
  `DockerSandboxContext` vs `ModalSandboxContext`.
- Submission flow now matches mini_swe_agent_v2: the prompt instructs
  the agent to write its diff to `patch.txt`, and `_extract_patch` / the
  `base_commit` capture were deleted in favor of a simple `cat patch.txt`
  after the agent finishes. New `_submission_instructions(is_coop)` helper.

## Runner change (src/cooperbench/runner/coop.py)

- Relaxed the `openhands_sdk` git-server-skip from "always" to
  "modal-only". On docker, the adapter now uses the shared
  `DockerGitServer` like every other adapter.

## Eval change (src/cooperbench/eval/backends/docker.py)

- Eval `containers.run` tries the host's native platform first; on
  "no matching manifest" / platform-mismatch it retries with
  `platform=linux/amd64`. Benchmark base images are a mix of
  arm64-native and amd64-only, so neither pin alone is correct on
  Apple Silicon.

## Validation

End-to-end validated by a full 50-pair `flash`-subset coop+git run on
docker backend (`gemini/gemini-3-flash-preview`): 50/50 pairs Submitted,
9 both_passed (18%), 81 min wall time, $33.22 total. Zero pipeline
errors; six pairs hit the upstream OpenHands SDK stuck-detection
(known intermittent, surfaces in Modal runs too).

CI:
- `uv run ruff check src/cooperbench/` — clean
- `uv run ruff format --check src/cooperbench/` — clean
- `uv run python -m mypy src/cooperbench/` — clean
- `uv run python -m pytest tests/ -v` — 385 passed, 63 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant