2. Add prompt presets#54
Conversation
Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true
Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true
…pt-presets-localized
|
@kargibora this is the same as #48 I just split some logic related to mt-bench into another one #55 . If you want you can give it another review but it should be the same as what you saw before. #55 is new though. |
kargibora
left a comment
There was a problem hiding this comment.
Thanks for the clean-up! I left some more comments after reading it throughly.
| provide_explanation=provide_explanation, | ||
| ) | ||
|
|
||
| if preset is None: |
There was a problem hiding this comment.
If preset=None, we set it to DEFAULT_WITH_EXPLANATION_PRESET if explanation is provided. So we can not use task and provide_explanation at the same time. Is this intended?
| return TASK_DEFAULT_PRESET[task] | ||
| if task.startswith("m-arena-hard"): | ||
| return DEFAULT_JUDGE_PROMPT_PRESET | ||
| if task.startswith("fluency"): |
There was a problem hiding this comment.
I think checking tasks specifically for these keywords are not so optimal. If we use `-fleuncy it will fail for example. We can use asserts and task resolving structure for this one.
| ) | ||
| _append_results(swapped_judgments, swapped_prompt_kwargs, swapped=True) | ||
|
|
||
| return pd.Series(preferences, dtype=float), annotations, metadata, 0 |
There was a problem hiding this comment.
It is a bit confusing, we say return type is None but we return tuple with 4 elements?
| judge_chat_model, | ||
| resolved_prompt: ResolvedJudgePrompt, | ||
| ) -> pd.Series: | ||
| prefs, annotations, combined_metadata, _num_inconsistent = ( |
There was a problem hiding this comment.
We return 0 in the judge_mt_bench_with_preset, even if we do not, we never use _num_inconsistent variable. We should either discard it ore use it properly.
Summary
Stack
pr32-split-v2/02.5-mt-bench-preset-judging.