Skip to content

feat(olmocr): add chandra-ocr-2 runner#3

Open
Khurdhula-Harshavardhan wants to merge 1 commit into
mainfrom
feat/olmocr-chandra-ocr-2
Open

feat(olmocr): add chandra-ocr-2 runner#3
Khurdhula-Harshavardhan wants to merge 1 commit into
mainfrom
feat/olmocr-chandra-ocr-2

Conversation

@Khurdhula-Harshavardhan
Copy link
Copy Markdown
Contributor

Summary

  • Adds olmocr_bench_chandra_ocr2.py + bench/runners/run_chandra_ocr2.py, following the same openai_mini/grok/reducto template
  • Dispatches each rendered PDF page to a self-hosted datalab-to/chandra-ocr-2 vLLM endpoint (HTTPS POST /ocr), auth via x-api-admin-key header
  • Apples-to-apples with other candidates: temperature=0, task="ocr_layout", no reasoning/thinking mode (Chandra OCR 2 is a fine-tuned OCR VLM with none)
  • Harness supports --sample, --skip-generation, --generate-only, --limit N (smoke)
  • README updated under the olmOCR section to list the runner + the two required env vars

Required env

CHANDRA_MODAL_URL=https://<workspace>--mlt-chandra-ocr-chandraocr-api.modal.run
CHANDRA_MODAL_ADMIN_KEY=<admin-key>

Notes

  • RATE_LIMIT=50 matches a 10-H100 deployment; turn down if running against a single warm container
  • Existing-file checkpointing inherited from the openai_mini template — re-running resumes from the last completed page

Test plan

  • Both files parse via python -m py_compile
  • Full bench run (1,403 pages, 8,413 assertions) — 84.3% ± 0.9% overall (within 1.6 pts of Datalab's headline 85.9%)
  • Existing-file skip verified across a crashed-run resume
  • CI: none configured for this repo yet

Mirrors the existing per-provider olmOCR runners (openai_mini, grok, etc.)
but dispatches each rendered PDF page to a self-hosted `datalab-to/chandra-ocr-2`
vLLM endpoint over HTTPS. Auth via the `x-api-admin-key` header; runner reads
`CHANDRA_MODAL_URL` and `CHANDRA_MODAL_ADMIN_KEY` from env. Request payload
uses `task="ocr_layout"` and `temperature=0.0` to stay apples-to-apples with
the other candidates (no reasoning/thinking mode — Chandra OCR 2 is a
fine-tuned OCR VLM with none).

Harness supports `--sample`, `--skip-generation`, `--generate-only`, and
`--limit N` (smoke). `RATE_LIMIT=50` matches the throughput of a 10-H100
deployment; turn down if running against a single warm container.

README updated under the olmOCR section to list the new runner and the two
required env vars.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant