Integration of Soft-ELO by kargibora · Pull Request #42 · OpenEuroLLM/JudgeArena

kargibora · 2026-05-12T13:10:13Z

Implements the Soft-Elo pipeline: feed the judge's calibrated score-difference into the Bradley–Terry fit as a soft preference $\tilde y = \sigma(\beta s)$ instead of discretising to win/loss/tie. Optionally MLE-fit $\beta$ on human-labeled arena battles before the main run.

What changed

fit_bradley_terry (estimate_elo_ratings.py) replaces compute_bradley_terry. Takes a soft target pref_col ∈ [0,1] (0=A wins,1=B wins, 0.5=tie) and uses the standard soft-CE → weighted-LR decomposition. Hard labels ({0, 0.5, 1}) reduce to the previous fit.
Temperature calibration (evaluate.calibrate_temperature): concave MLE for $\beta^\star$ via scipy.optimize.minimize_scalar on $\sum\log\sigma(\beta(2y-1)\Delta s)$. Driven from estimate_elo_ratings.main — samples human battles, reruns the judge on them, parses raw scores with PairScore(temperature=1.0), fits $\beta^\star$, then re-parses all cached judge completions with the calibrated temperature (handles swap_mode="both" reconstruction).
Reporting: human-only BT ratings computed as ground-truth reference, prints MAE vs Human-Elo on overlapping models; return dict gains human_elo, mae_vs_human, method, calibrated_temperature.

New flags

Flag	Default	Effect
`--soft-elo`	off	Use soft BT targets instead of hard {0, 0.5, 1} labels.
`--soft-elo-temperature`	`0.3`	Initial $\beta$; overridden if calibration runs. Empirical range across judges in the paper: `[0.36, 0.60]`.
`--calibrate-temperature`	off	MLE-fit $\beta^\star$ on human-labeled arena battles before the run. Requires `--soft-elo`; warns and skips otherwise.
`--calibration-size`	all human battles	Number of human battles to sample for calibration. Needs `--calibrate-temperature`.

How to run

Hard-Elo (unchanged behavior):

judgearena --task elo-lmarena-100k \
  --model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 200

Soft-Elo with calibration (recommended):
judgearena --task elo-lmarena-100k \
  --model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 200 \
  --soft-elo --calibrate-temperature --calibration-size 300

How to test

uv run pytest tests/test_cli.py tests/test_estimate_elo_ratings.py

test_cli.py covers the new flags routing through the unified entrypoint;
test_estimate_elo_ratings.py covers fit_bradley_terry and the main pipeline.

kargibora · 2026-05-27T11:55:48Z

With latest commit d53cf64, --soft-elo is the default flag (opt --no-soft-elo to use normal one). However I did not change the --calibrate-temperature as default, because I think this requires some extra computation.

ErlisLushtaku · 2026-05-27T20:38:05Z

@kargibora is this ready for review?

kargibora · 2026-05-28T07:28:22Z

@kargibora is this ready for review?

Yes it is!

ErlisLushtaku · 2026-06-02T13:31:42Z

+                "model_a": opp,
+                "model_b": model_name,
+                "winner": winner,
+                "pref": None if _is_nan_pref(pref) else 1.0 - pref,


This inverts the soft preference for every battle where our model is in position B. model_a/model_b here already encode position (A-model / B-model), and pref is P(model_b wins), the same convention fit_bradley_terry uses, so it must be stored unchanged, exactly like the is_pos_a branch on line 158.

With 1.0 - pref, pref disagrees with both winner and pref_hard, and since soft_elo is now the default, the model under evaluation gets systematically corrupted ratings (a 100%-win model ranks last). I think we should use "pref": pref (keep the None/NaN handling).

734c08b fixes this, thanks!

ErlisLushtaku · 2026-06-02T13:39:20Z

+    # If we calibrated the temperature, the prefs stored in df_judge were
+    # computed with the default T=0.3.  Re-parse them with the new parser so
+    # the soft-ELO bootstrap uses calibrated preferences.
+    if calibrated_temperature is not None:


score_parser is only used when calibration ran, so --soft-elo-temperature is dead in the default path (cached df_judge["pref"] is always T=0.3, and the cache key doesn't include temperature). Would recommend to always re-parsing from df_judge["judge_completion"] with score_parser for the soft path (drop the if calibrated_temperature is not None gate), which also unifies the duplicated swap_mode="both" recombination logic with the block in run_judge.

Closed with 7411407

ErlisLushtaku · 2026-06-02T13:46:38Z

+                completions_A=cal_completions_a,
+                completions_B=cal_completions_b,
+                swap_mode=args.swap_mode,
+                truncate_input_chars=args.truncate_all_input_chars,


Calibration should mirror the main judge run so T* matches the score distribution it's applied to, so I think we should use truncate_input_chars=args.truncate_judge_input_chars and pass provide_explanation=args.provide_explanation. Otherwise the calibrated temperature is fit on a different prompt/truncation regime than the evaluation.

ErlisLushtaku · 2026-06-02T13:50:04Z

+            _cal_n = (
+                min(args.calibration_size, len(df_arena))
+                if args.calibration_size is not None
+                else len(df_arena)


Defaulting calibration to all arena battles can mean tens of thousands of (uncached) judge calls, which is a big API cost. We could use a default cap (e.g. a few hundred) and wrapping this judge pass in cache_function_dataframe so reruns don't re-pay.

This is something we have discussed with David. We agreed that the default parameters should be the ones that works best according to us, but I am more biased towards making this optional as you have suggested. @geoalgo

ErlisLushtaku · 2026-06-02T14:03:34Z

+            for ann, human_winner in zip(
+                cal_annotations, cal_battles["winner"].tolist(), strict=True
+            ):
+                sa = raw_parser.get_regexp_match(


This duplicates PairScore.parse_model_raw's extraction while being slightly different, so they can drift. We could add a PairScore.parse_raw_scores(completion) and call it from both parse_model_raw and here.

Also raw_parser = PairScore(temperature=1.0) / "using default T=1" (on this line) could be misleading as get_regexp_match doesn't use temperature.

Oh you are right, thanks for caching it.

kargibora · 2026-06-08T09:13:53Z

I have pushed the changes, thank you @ErlisLushtaku! It will be ready to merge after deciding whether to set soft-elo default or not. I think discussing this is important again.

kargibora added 11 commits April 14, 2026 15:33

Add soft elo

af4bced

Add temperature calibration

898b1e4

Update READMe for soft-elo support

e4498b6

Update temperature

6b401e8

Merge branch 'main' into feat/soft-elo

6f960af

Update CLI to unify elo computation

995db21

Remove duplication

b357116

Fix a edge case when all the labels are same

be53e8c

ruff fix

61f1f84

Merge branch 'main' into feat/soft-elo

22aa56e

Make soft-elo default

d53cf64

ErlisLushtaku reviewed Jun 2, 2026

View reviewed changes

kargibora added 5 commits June 8, 2026 10:39

fix preference bug

734c08b

fix soft-elo dead flag bug

7411407

fix: calibration error

8b72171

bug: fix the problem in the regex parser

9855baf

Merge branch 'main' into feat/soft-elo

8247957

Conversation

kargibora commented May 12, 2026

What changed

New flags

How to run

How to test

Uh oh!

kargibora commented May 27, 2026

Uh oh!

ErlisLushtaku commented May 27, 2026

Uh oh!

kargibora commented May 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kargibora commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants