Skip to content

2. Add prompt presets#54

Open
ErlisLushtaku wants to merge 10 commits into
mainfrom
pr32-split-v2/02-prompt-presets-only
Open

2. Add prompt presets#54
ErlisLushtaku wants to merge 10 commits into
mainfrom
pr32-split-v2/02-prompt-presets-only

Conversation

@ErlisLushtaku

@ErlisLushtaku ErlisLushtaku commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Move shared judge prompt preset resolution into its own stack layer.
  • Wire prompt preset selection through CLI, pairwise judging, and MT-Bench preset judging.

Stack

  • Next: pr32-split-v2/02.5-mt-bench-preset-judging.

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true
Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true
@ErlisLushtaku

Copy link
Copy Markdown
Collaborator Author

@kargibora this is the same as #48 I just split some logic related to mt-bench into another one #55 . If you want you can give it another review but it should be the same as what you saw before. #55 is new though.

@ErlisLushtaku ErlisLushtaku requested a review from kargibora May 28, 2026 09:04

@kargibora kargibora left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clean-up! I left some more comments after reading it throughly.

provide_explanation=provide_explanation,
)

if preset is None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If preset=None, we set it to DEFAULT_WITH_EXPLANATION_PRESET if explanation is provided. So we can not use task and provide_explanation at the same time. Is this intended?

return TASK_DEFAULT_PRESET[task]
if task.startswith("m-arena-hard"):
return DEFAULT_JUDGE_PROMPT_PRESET
if task.startswith("fluency"):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think checking tasks specifically for these keywords are not so optimal. If we use `-fleuncy it will fail for example. We can use asserts and task resolving structure for this one.

)
_append_results(swapped_judgments, swapped_prompt_kwargs, swapped=True)

return pd.Series(preferences, dtype=float), annotations, metadata, 0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit confusing, we say return type is None but we return tuple with 4 elements?

judge_chat_model,
resolved_prompt: ResolvedJudgePrompt,
) -> pd.Series:
prefs, annotations, combined_metadata, _num_inconsistent = (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We return 0 in the judge_mt_bench_with_preset, even if we do not, we never use _num_inconsistent variable. We should either discard it ore use it properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants