Skip to content

Integration of Soft-ELO#42

Open
kargibora wants to merge 16 commits into
mainfrom
feat/soft-elo
Open

Integration of Soft-ELO#42
kargibora wants to merge 16 commits into
mainfrom
feat/soft-elo

Conversation

@kargibora

Copy link
Copy Markdown
Collaborator

Implements the Soft-Elo pipeline: feed the judge's calibrated score-difference into the Bradley–Terry fit as a soft preference $\tilde y = \sigma(\beta s)$ instead of discretising to win/loss/tie. Optionally MLE-fit $\beta$ on human-labeled arena battles before the main run.

What changed

  • fit_bradley_terry (estimate_elo_ratings.py) replaces compute_bradley_terry. Takes a soft target pref_col ∈ [0,1] (0=A wins,1=B wins, 0.5=tie) and uses the standard soft-CE → weighted-LR decomposition. Hard labels ({0, 0.5, 1}) reduce to the previous fit.
  • Temperature calibration (evaluate.calibrate_temperature): concave MLE for $\beta^\star$ via scipy.optimize.minimize_scalar on $\sum\log\sigma(\beta(2y-1)\Delta s)$. Driven from estimate_elo_ratings.main — samples human battles, reruns the judge on them, parses raw scores with PairScore(temperature=1.0), fits $\beta^\star$, then re-parses all cached judge completions with the calibrated temperature (handles swap_mode="both" reconstruction).
  • Reporting: human-only BT ratings computed as ground-truth reference, prints MAE vs Human-Elo on overlapping models; return dict gains human_elo, mae_vs_human, method, calibrated_temperature.

New flags

Flag Default Effect
--soft-elo off Use soft BT targets instead of hard {0, 0.5, 1} labels.
--soft-elo-temperature 0.3 Initial $\beta$; overridden if calibration runs. Empirical range across judges in the paper: [0.36, 0.60].
--calibrate-temperature off MLE-fit $\beta^\star$ on human-labeled arena battles before the run. Requires --soft-elo; warns and skips otherwise.
--calibration-size all human battles Number of human battles to sample for calibration. Needs --calibrate-temperature.

How to run

Hard-Elo (unchanged behavior):

judgearena --task elo-lmarena-100k \
  --model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 200

Soft-Elo with calibration (recommended):
judgearena --task elo-lmarena-100k \
  --model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 200 \
  --soft-elo --calibrate-temperature --calibration-size 300

How to test

uv run pytest tests/test_cli.py tests/test_estimate_elo_ratings.py

  • test_cli.py covers the new flags routing through the unified entrypoint;
  • test_estimate_elo_ratings.py covers fit_bradley_terry and the main pipeline.

@kargibora

Copy link
Copy Markdown
Collaborator Author

With latest commit d53cf64, --soft-elo is the default flag (opt --no-soft-elo to use normal one). However I did not change the --calibrate-temperature as default, because I think this requires some extra computation.

@ErlisLushtaku

Copy link
Copy Markdown
Collaborator

@kargibora is this ready for review?

@kargibora

Copy link
Copy Markdown
Collaborator Author

@kargibora is this ready for review?

Yes it is!

Comment thread judgearena/estimate_elo_ratings.py Outdated
"model_a": opp,
"model_b": model_name,
"winner": winner,
"pref": None if _is_nan_pref(pref) else 1.0 - pref,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inverts the soft preference for every battle where our model is in position B. model_a/model_b here already encode position (A-model / B-model), and pref is P(model_b wins), the same convention fit_bradley_terry uses, so it must be stored unchanged, exactly like the is_pos_a branch on line 158.

With 1.0 - pref, pref disagrees with both winner and pref_hard, and since soft_elo is now the default, the model under evaluation gets systematically corrupted ratings (a 100%-win model ranks last). I think we should use "pref": pref (keep the None/NaN handling).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

734c08b fixes this, thanks!

Comment thread judgearena/estimate_elo_ratings.py Outdated
# If we calibrated the temperature, the prefs stored in df_judge were
# computed with the default T=0.3. Re-parse them with the new parser so
# the soft-ELO bootstrap uses calibrated preferences.
if calibrated_temperature is not None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

score_parser is only used when calibration ran, so --soft-elo-temperature is dead in the default path (cached df_judge["pref"] is always T=0.3, and the cache key doesn't include temperature). Would recommend to always re-parsing from df_judge["judge_completion"] with score_parser for the soft path (drop the if calibrated_temperature is not None gate), which also unifies the duplicated swap_mode="both" recombination logic with the block in run_judge.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closed with 7411407

Comment thread judgearena/estimate_elo_ratings.py Outdated
completions_A=cal_completions_a,
completions_B=cal_completions_b,
swap_mode=args.swap_mode,
truncate_input_chars=args.truncate_all_input_chars,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calibration should mirror the main judge run so T* matches the score distribution it's applied to, so I think we should use truncate_input_chars=args.truncate_judge_input_chars and pass provide_explanation=args.provide_explanation. Otherwise the calibrated temperature is fit on a different prompt/truncation regime than the evaluation.

_cal_n = (
min(args.calibration_size, len(df_arena))
if args.calibration_size is not None
else len(df_arena)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting calibration to all arena battles can mean tens of thousands of (uncached) judge calls, which is a big API cost. We could use a default cap (e.g. a few hundred) and wrapping this judge pass in cache_function_dataframe so reruns don't re-pay.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we have discussed with David. We agreed that the default parameters should be the ones that works best according to us, but I am more biased towards making this optional as you have suggested. @geoalgo

Comment thread judgearena/estimate_elo_ratings.py Outdated
for ann, human_winner in zip(
cal_annotations, cal_battles["winner"].tolist(), strict=True
):
sa = raw_parser.get_regexp_match(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates PairScore.parse_model_raw's extraction while being slightly different, so they can drift. We could add a PairScore.parse_raw_scores(completion) and call it from both parse_model_raw and here.

Also raw_parser = PairScore(temperature=1.0) / "using default T=1" (on this line) could be misleading as get_regexp_match doesn't use temperature.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you are right, thanks for caching it.

@kargibora

Copy link
Copy Markdown
Collaborator Author

I have pushed the changes, thank you @ErlisLushtaku! It will be ready to merge after deciding whether to set soft-elo default or not. I think discussing this is important again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants