Add judge-prompt registry with per-task defaults by alexrs-cohere · Pull Request #40 · OpenEuroLLM/JudgeArena

alexrs-cohere · 2026-04-29T09:53:38Z

Summary

Introduces a small named registry of judge prompts in judgearena.prompts.registry so every benchmark gets a sensible default that is also recorded by hash in the run metadata. Four bundled presets:

default — score-only system + user pair (used by alpaca-eval, arena-hard-v0.1, arena-hard-v2.0, m-arena-hard*).
default_with_explanation — equivalent to today's --provide_explanation.
fluency — the inline fluency system prompt that previously lived inside generate_and_evaluate.py, paired with the default user template.
fastchat-pairwise — name-only delegate for the MT-Bench pipeline, whose category-aware prompts continue to be selected by fastchat_compat._select_prompt.

TASK_DEFAULT_PRESET plus the prefix-aware default_preset_for_task decide which preset a task gets when the user does not pass --judge_prompt_preset.

Three new CLI flags select the prompt:

--judge_prompt_preset NAME — pick a registered preset.
--judge_system_prompt_file / --judge_user_prompt_file — fully custom prompts; both file overrides take precedence over the preset when set.

--provide_explanation is preserved and is now equivalent to
--judge_prompt_preset default_with_explanation.

judgearena/evaluate.py::resolve_judge_prompts becomes a thin backward-compatible shim around the registry, and a new resolve_run_judge_prompt(task, cli_args) returns the full ResolvedJudgePrompt (including SHAs/paths) so follow-up PRs can fold that hash into cache keys and run metadata.

The CLI dispatcher forwards the three new fields through to both CliArgs and CliEloArgs. generate_and_evaluate.py is intentionally left unchanged in this PR: its existing resolve_judge_prompts call goes through the new shim and continues to behave exactly as before — a follow-up PR will switch the call site to the task-aware resolve_run_judge_prompt and drop the now-redundant inline fluency system prompt.

Why

Today the judge prompt is hard-coded to two on-disk templates and the fluency variant lives as an inline system_prompt = \"\"\"...\"\"\" string in generate_and_evaluate.py. There is no way to:

record which prompt a run actually used (the metadata bundle stores
only SHA-256 hashes after-the-fact);
swap prompts per task without editing the library;
carry a reproducible reference into a re-run.

This PR is the smallest piece of plumbing that gives us a named registry + per-task default mapping + CLI surface; subsequent PRs in the reproducibility-hardening stack will fold the resolved prompt SHA into cache keys and write the verbatim templates next to each run.

Test plan

uv run pytest -q
tests/test_prompt_registry.py covers task defaults, explicit presets, the provide_explanation alias, file overrides winning over presets, the fastchat-pairwise delegate, and error paths.
tests/test_cli.py::test_judge_prompt_preset_flag_is_forwarded verifies the new flag survives CLI dispatch.
CI runs the full suite.

Introduces a small named registry of judge prompts in judgearena.prompts.registry so every benchmark gets a sensible default that is also recorded by hash in the run metadata. The four bundled presets are: - ``default`` -- score-only system + user pair (used by alpaca-eval, arena-hard-v0.1/v2.0, m-arena-hard*, fluency tasks fall through to ``fluency``). - ``default_with_explanation`` -- equivalent to today's ``--provide_explanation``. - ``fluency`` -- the inline fluency system prompt that previously lived inside ``generate_and_evaluate.py``, paired with the default user template. - ``fastchat-pairwise`` -- name-only delegate for the MT-Bench pipeline, whose category-aware prompts continue to be selected by ``fastchat_compat._select_prompt``. ``TASK_DEFAULT_PRESET`` plus the prefix-aware ``default_preset_for_task`` decide which preset a task gets when the user does not pass ``--judge_prompt_preset``. Three new CLI flags select the prompt: - ``--judge_prompt_preset NAME`` -- pick a registered preset. - ``--judge_system_prompt_file`` / ``--judge_user_prompt_file`` -- fully custom prompts; both file overrides take precedence over the preset when set. ``--provide_explanation`` is preserved and is now equivalent to ``--judge_prompt_preset default_with_explanation``. ``judgearena/evaluate.py``::``resolve_judge_prompts`` becomes a thin backward-compatible shim around the registry, and a new ``resolve_run_judge_prompt(task, cli_args)`` returns the full ``ResolvedJudgePrompt`` (including SHAs/paths) so future PRs can fold that hash into cache keys and run metadata. The CLI dispatcher forwards the three new fields through to both ``CliArgs`` and ``CliEloArgs``. ``generate_and_evaluate.py`` is left unchanged: its existing ``resolve_judge_prompts`` call goes through the new shim and continues to behave exactly as before. Tests cover the registry resolution rules (task defaults, preset by name, file overrides, ``provide_explanation`` alias, ``fastchat-pairwise`` delegation) plus a CLI test that confirms ``--judge_prompt_preset`` is parsed and forwarded. Made-with: Cursor

ErlisLushtaku · 2026-04-30T13:28:52Z

    )


+def resolve_run_judge_prompt(task: str, cli_args) -> ResolvedJudgePrompt:


This looks like the intended single resolution path for the new CLI prompt options, but I can only find it referenced from the new unit test, not from production code.

Yes! I'm working on a PR on top of this one that uses this across multiple files (generate_and_evaluate.py, estimate_elo_ratings.py, generate.py)

ErlisLushtaku · 2026-04-30T13:30:50Z

PR #32 also rewrites this area, but it threads a richer resolved-prompt object with parser-mode metadata through the judging pipeline. This PR adds a second prompt-resolution abstraction in the same file. Can we unify on one resolver / one return type before merging? Otherwise this will conflict both textually and semantically with #32

You have more context on the codebase so I'll let you decide what's the best approach!

To provide some more context, the follow-up PR I'm working on passes the prompt, and all the generation and sampling parameters to the different models (ChatVLLM, ChatOpenAI, etc) and writes them to the metadata.

ErlisLushtaku · 2026-05-22T08:56:45Z

@alexrs-cohere could you please check/review this PR which is the extracted part related to judge presets that I used for the paper submission. I also have a sequence of cascading PRs on top of that (see the numbered PRs). Feel free to add any suggestions for changes as you have thought about this more probably.

Let me know if you have thoughts on the best way to proceed with the merging of these two PRs if you think both are needed. We could either merge this PR into that PR and propagate the changes to the cascading PRs on top, or we could merge those PRs first and then implement the necessary changes from this one at the end. I think the second option would be a bit easier as it requires only one rebase.

geoalgo · 2026-05-26T13:19:33Z

We plan to merge #48 first because we may have lots of conflict / rebase otherwise.

Will close this one once we have made sure that we support the same features.

ErlisLushtaku · 2026-05-27T21:31:04Z

Implemented the features of this PR (and the wiring) in #48.

geoalgo requested a review from ErlisLushtaku April 29, 2026 16:21

ErlisLushtaku reviewed Apr 30, 2026

View reviewed changes

ErlisLushtaku closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add judge-prompt registry with per-task defaults#40

Add judge-prompt registry with per-task defaults#40
alexrs-cohere wants to merge 1 commit into
OpenEuroLLM:mainfrom
alexrs-cohere:feat/judge-prompt-registry

alexrs-cohere commented Apr 29, 2026 •

edited

Loading

Uh oh!

ErlisLushtaku Apr 30, 2026

Uh oh!

alexrs-cohere Apr 30, 2026

Uh oh!

ErlisLushtaku Apr 30, 2026

Uh oh!

alexrs-cohere Apr 30, 2026

Uh oh!

alexrs-cohere Apr 30, 2026

Uh oh!

ErlisLushtaku commented May 22, 2026 •

edited

Loading

Uh oh!

geoalgo commented May 26, 2026

Uh oh!

ErlisLushtaku commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		)


		def resolve_run_judge_prompt(task: str, cli_args) -> ResolvedJudgePrompt:

Conversation

alexrs-cohere commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

ErlisLushtaku Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

alexrs-cohere Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

alexrs-cohere Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

alexrs-cohere Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geoalgo commented May 26, 2026

Uh oh!

ErlisLushtaku commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexrs-cohere commented Apr 29, 2026 •

edited

Loading

ErlisLushtaku commented May 22, 2026 •

edited

Loading