Skip to content

Save arena battles and ELO ratings as artifacts#63

Open
kargibora wants to merge 1 commit into
mainfrom
feat/save-arena-battles
Open

Save arena battles and ELO ratings as artifacts#63
kargibora wants to merge 1 commit into
mainfrom
feat/save-arena-battles

Conversation

@kargibora

Copy link
Copy Markdown
Collaborator

Problem

The arena ELO pipeline (estimate_elo_ratings.main) computed Bradley-Terry ratings but saved nothing usable.
The ratings were only printed and returned (the CLI discards the return value), and the per-battle data lived only in the internal cache/elo/*.csv.zip recompute-skipping layer.
There was no battle file and no ratings file, so a run could not be visualized or its ELO recomputed without rerunning the whole GPU job.
The pairwise pipeline already writes a results folder; the arena pipeline did not.

Solution

Persist the run as a small set of artifacts, built around the battle as the atomic unit: ELO is a pure function of a list of battles, so saving the battles (plus the bootstrap ratings) is enough to reconstruct or re-analyse a run.

New module judgearena/battles.py:

  • Battle — one outcome (model_a, model_b, winner, source, question_id?, judge_model?). from_dict ignores unknown keys, so old files keep loading after the schema grows.
  • write_battles / read_battles / battles_to_frame — JSONL round-trip plus the bridge to the existing compute_bradley_terry.
  • RatingEntry / EloReport — the leaderboard (mean + bootstrap CI per model) and run metadata.

estimate_elo_ratings.main now writes a results folder:

results/{arena}-{model}-{judge}-{ts}/
  battles.jsonl          # source of truth: every battle, tagged source = llm-judge | human
  elo_ratings.json       # leaderboard: rating + ci_low/ci_high + n_battles per model
  bootstrap_ratings.csv  # full bootstrap matrix, so CIs can be re-plotted without rerunning
  run-metadata.v1.json   # git hash, deps, timings, artifacts manifest (repro.py)

The ELO math is unchanged; the combined battle frame just gained source / question_id / judge_model columns that compute_bradley_terry ignores.

Design notes

  • battles.jsonl inlines the human arena battles (not just the judged ones) so the file is self-contained and ELO is recomputable from it alone. This is why it is large (~16 MB / ~87k rows for LMArena-100k).
  • Two source vocabularies: Battle.source is llm-judge | human (who decided the outcome); RatingEntry.source is evaluated | human (the model under test vs. every other model on the board).

Example output

Run: Qwen2.5-1.5B-Instruct placed into LMArena-100k, judged by Qwen3-8B, 100 battles, 100 bootstraps.

Leaderboard (printed and saved):

chatgpt-4o-latest                (4697): 1114.3 ± 4.1
gpt-4o-2024-05-13                (8714): 1088.8 ± 3.2
...
phi-3-mini-4k-instruct           (538) :  890.2 ± 12.0
VLLM/Qwen/Qwen2.5-1.5B-Instruct  (100) <-----: 686.2 ± 47.9

battles.jsonl (87,331 rows = 100 llm-judge + 87,231 human):

{"model_a": "VLLM/Qwen/Qwen2.5-1.5B-Instruct", "model_b": "gpt-3.5-turbo-0125", "winner": "model_a", "source": "llm-judge", "question_id": "4c6978df...", "judge_model": "VLLM/Qwen/Qwen3-8B"}
{"model_a": "claude-3-5-sonnet-20240620", "model_b": "gpt-3.5-turbo-0125", "winner": "tie (bothbad)", "source": "human", "question_id": "1a2b3c4d...", "judge_model": null}

elo_ratings.json:

{
  "arena": "LMArena-100k",
  "model": "VLLM/Qwen/Qwen2.5-1.5B-Instruct",
  "judge_model": "VLLM/Qwen/Qwen3-8B",
  "n_bootstraps": 100,
  "seed": 0,
  "ratings": [
    {"model": "chatgpt-4o-latest", "rating": 1114.3, "ci_low": 1105.9, "ci_high": 1121.8, "n_battles": 4697, "source": "human"},
    {"model": "VLLM/Qwen/Qwen2.5-1.5B-Instruct", "rating": 686.2, "ci_low": 578.8, "ci_high": 780.4, "n_battles": 100, "source": "evaluated"}
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant