Save arena battles and ELO ratings as artifacts by kargibora · Pull Request #63 · OpenEuroLLM/JudgeArena

kargibora · 2026-06-10T13:13:14Z

Problem

The arena ELO pipeline (estimate_elo_ratings.main) computed Bradley-Terry ratings but saved nothing usable.
The ratings were only printed and returned (the CLI discards the return value), and the per-battle data lived only in the internal cache/elo/*.csv.zip recompute-skipping layer.
There was no battle file and no ratings file, so a run could not be visualized or its ELO recomputed without rerunning the whole GPU job.
The pairwise pipeline already writes a results folder; the arena pipeline did not.

Solution

Persist the run as a small set of artifacts, built around the battle as the atomic unit: ELO is a pure function of a list of battles, so saving the battles (plus the bootstrap ratings) is enough to reconstruct or re-analyse a run.

New module judgearena/battles.py:

Battle — one outcome (model_a, model_b, winner, source, question_id?, judge_model?). from_dict ignores unknown keys, so old files keep loading after the schema grows.
write_battles / read_battles / battles_to_frame — JSONL round-trip plus the bridge to the existing compute_bradley_terry.
RatingEntry / EloReport — the leaderboard (mean + bootstrap CI per model) and run metadata.

estimate_elo_ratings.main now writes a results folder:

results/{arena}-{model}-{judge}-{ts}/
  battles.jsonl          # source of truth: every battle, tagged source = llm-judge | human
  elo_ratings.json       # leaderboard: rating + ci_low/ci_high + n_battles per model
  bootstrap_ratings.csv  # full bootstrap matrix, so CIs can be re-plotted without rerunning
  run-metadata.v1.json   # git hash, deps, timings, artifacts manifest (repro.py)

The ELO math is unchanged; the combined battle frame just gained source / question_id / judge_model columns that compute_bradley_terry ignores.

Design notes

battles.jsonl inlines the human arena battles (not just the judged ones) so the file is self-contained and ELO is recomputable from it alone. This is why it is large (~16 MB / ~87k rows for LMArena-100k).
Two source vocabularies: Battle.source is llm-judge | human (who decided the outcome); RatingEntry.source is evaluated | human (the model under test vs. every other model on the board).

Example output

Run: Qwen2.5-1.5B-Instruct placed into LMArena-100k, judged by Qwen3-8B, 100 battles, 100 bootstraps.

Leaderboard (printed and saved):

chatgpt-4o-latest                (4697): 1114.3 ± 4.1
gpt-4o-2024-05-13                (8714): 1088.8 ± 3.2
...
phi-3-mini-4k-instruct           (538) :  890.2 ± 12.0
VLLM/Qwen/Qwen2.5-1.5B-Instruct  (100) <-----: 686.2 ± 47.9

battles.jsonl (87,331 rows = 100 llm-judge + 87,231 human):

{"model_a": "VLLM/Qwen/Qwen2.5-1.5B-Instruct", "model_b": "gpt-3.5-turbo-0125", "winner": "model_a", "source": "llm-judge", "question_id": "4c6978df...", "judge_model": "VLLM/Qwen/Qwen3-8B"}
{"model_a": "claude-3-5-sonnet-20240620", "model_b": "gpt-3.5-turbo-0125", "winner": "tie (bothbad)", "source": "human", "question_id": "1a2b3c4d...", "judge_model": null}

elo_ratings.json:

{
  "arena": "LMArena-100k",
  "model": "VLLM/Qwen/Qwen2.5-1.5B-Instruct",
  "judge_model": "VLLM/Qwen/Qwen3-8B",
  "n_bootstraps": 100,
  "seed": 0,
  "ratings": [
    {"model": "chatgpt-4o-latest", "rating": 1114.3, "ci_low": 1105.9, "ci_high": 1121.8, "n_battles": 4697, "source": "human"},
    {"model": "VLLM/Qwen/Qwen2.5-1.5B-Instruct", "rating": 686.2, "ci_low": 578.8, "ci_high": 780.4, "n_battles": 100, "source": "evaluated"}
  ]
}

save better artifacts for the arena

cf263c3

kargibora requested review from ErlisLushtaku and geoalgo June 11, 2026 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save arena battles and ELO ratings as artifacts#63

Save arena battles and ELO ratings as artifacts#63
kargibora wants to merge 1 commit into
mainfrom
feat/save-arena-battles

kargibora commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kargibora commented Jun 10, 2026

Problem

Solution

Design notes

Example output

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant