Integration of Soft-ELO#42
Conversation
|
With latest commit d53cf64, |
|
@kargibora is this ready for review? |
Yes it is! |
| "model_a": opp, | ||
| "model_b": model_name, | ||
| "winner": winner, | ||
| "pref": None if _is_nan_pref(pref) else 1.0 - pref, |
There was a problem hiding this comment.
This inverts the soft preference for every battle where our model is in position B. model_a/model_b here already encode position (A-model / B-model), and pref is P(model_b wins), the same convention fit_bradley_terry uses, so it must be stored unchanged, exactly like the is_pos_a branch on line 158.
With 1.0 - pref, pref disagrees with both winner and pref_hard, and since soft_elo is now the default, the model under evaluation gets systematically corrupted ratings (a 100%-win model ranks last). I think we should use "pref": pref (keep the None/NaN handling).
| # If we calibrated the temperature, the prefs stored in df_judge were | ||
| # computed with the default T=0.3. Re-parse them with the new parser so | ||
| # the soft-ELO bootstrap uses calibrated preferences. | ||
| if calibrated_temperature is not None: |
There was a problem hiding this comment.
score_parser is only used when calibration ran, so --soft-elo-temperature is dead in the default path (cached df_judge["pref"] is always T=0.3, and the cache key doesn't include temperature). Would recommend to always re-parsing from df_judge["judge_completion"] with score_parser for the soft path (drop the if calibrated_temperature is not None gate), which also unifies the duplicated swap_mode="both" recombination logic with the block in run_judge.
| completions_A=cal_completions_a, | ||
| completions_B=cal_completions_b, | ||
| swap_mode=args.swap_mode, | ||
| truncate_input_chars=args.truncate_all_input_chars, |
There was a problem hiding this comment.
Calibration should mirror the main judge run so T* matches the score distribution it's applied to, so I think we should use truncate_input_chars=args.truncate_judge_input_chars and pass provide_explanation=args.provide_explanation. Otherwise the calibrated temperature is fit on a different prompt/truncation regime than the evaluation.
| _cal_n = ( | ||
| min(args.calibration_size, len(df_arena)) | ||
| if args.calibration_size is not None | ||
| else len(df_arena) |
There was a problem hiding this comment.
Defaulting calibration to all arena battles can mean tens of thousands of (uncached) judge calls, which is a big API cost. We could use a default cap (e.g. a few hundred) and wrapping this judge pass in cache_function_dataframe so reruns don't re-pay.
There was a problem hiding this comment.
This is something we have discussed with David. We agreed that the default parameters should be the ones that works best according to us, but I am more biased towards making this optional as you have suggested. @geoalgo
| for ann, human_winner in zip( | ||
| cal_annotations, cal_battles["winner"].tolist(), strict=True | ||
| ): | ||
| sa = raw_parser.get_regexp_match( |
There was a problem hiding this comment.
This duplicates PairScore.parse_model_raw's extraction while being slightly different, so they can drift. We could add a PairScore.parse_raw_scores(completion) and call it from both parse_model_raw and here.
Also raw_parser = PairScore(temperature=1.0) / "using default T=1" (on this line) could be misleading as get_regexp_match doesn't use temperature.
There was a problem hiding this comment.
Oh you are right, thanks for caching it.
|
I have pushed the changes, thank you @ErlisLushtaku! It will be ready to merge after deciding whether to set soft-elo default or not. I think discussing this is important again. |
Implements the Soft-Elo pipeline: feed the judge's calibrated score-difference into the Bradley–Terry fit as a soft preference$\tilde y = \sigma(\beta s)$ instead of discretising to win/loss/tie. Optionally MLE-fit $\beta$ on human-labeled arena battles before the main run.
What changed
fit_bradley_terry(estimate_elo_ratings.py) replacescompute_bradley_terry. Takes a soft targetpref_col ∈ [0,1](0=A wins,1=B wins, 0.5=tie) and uses the standard soft-CE → weighted-LR decomposition. Hard labels ({0, 0.5, 1}) reduce to the previous fit.evaluate.calibrate_temperature): concave MLE forscipy.optimize.minimize_scalaronestimate_elo_ratings.main— samples human battles, reruns the judge on them, parses raw scores withPairScore(temperature=1.0), fitsswap_mode="both"reconstruction).human_elo,mae_vs_human,method,calibrated_temperature.New flags
--soft-elo--soft-elo-temperature0.3[0.36, 0.60].--calibrate-temperature--soft-elo; warns and skips otherwise.--calibration-size--calibrate-temperature.How to run
Hard-Elo (unchanged behavior):
How to test
uv run pytest tests/test_cli.py tests/test_estimate_elo_ratings.py