Add judge-prompt registry with per-task defaults#40
Conversation
Introduces a small named registry of judge prompts in judgearena.prompts.registry so every benchmark gets a sensible default that is also recorded by hash in the run metadata. The four bundled presets are: - ``default`` -- score-only system + user pair (used by alpaca-eval, arena-hard-v0.1/v2.0, m-arena-hard*, fluency tasks fall through to ``fluency``). - ``default_with_explanation`` -- equivalent to today's ``--provide_explanation``. - ``fluency`` -- the inline fluency system prompt that previously lived inside ``generate_and_evaluate.py``, paired with the default user template. - ``fastchat-pairwise`` -- name-only delegate for the MT-Bench pipeline, whose category-aware prompts continue to be selected by ``fastchat_compat._select_prompt``. ``TASK_DEFAULT_PRESET`` plus the prefix-aware ``default_preset_for_task`` decide which preset a task gets when the user does not pass ``--judge_prompt_preset``. Three new CLI flags select the prompt: - ``--judge_prompt_preset NAME`` -- pick a registered preset. - ``--judge_system_prompt_file`` / ``--judge_user_prompt_file`` -- fully custom prompts; both file overrides take precedence over the preset when set. ``--provide_explanation`` is preserved and is now equivalent to ``--judge_prompt_preset default_with_explanation``. ``judgearena/evaluate.py``::``resolve_judge_prompts`` becomes a thin backward-compatible shim around the registry, and a new ``resolve_run_judge_prompt(task, cli_args)`` returns the full ``ResolvedJudgePrompt`` (including SHAs/paths) so future PRs can fold that hash into cache keys and run metadata. The CLI dispatcher forwards the three new fields through to both ``CliArgs`` and ``CliEloArgs``. ``generate_and_evaluate.py`` is left unchanged: its existing ``resolve_judge_prompts`` call goes through the new shim and continues to behave exactly as before. Tests cover the registry resolution rules (task defaults, preset by name, file overrides, ``provide_explanation`` alias, ``fastchat-pairwise`` delegation) plus a CLI test that confirms ``--judge_prompt_preset`` is parsed and forwarded. Made-with: Cursor
| ) | ||
|
|
||
|
|
||
| def resolve_run_judge_prompt(task: str, cli_args) -> ResolvedJudgePrompt: |
There was a problem hiding this comment.
This looks like the intended single resolution path for the new CLI prompt options, but I can only find it referenced from the new unit test, not from production code.
There was a problem hiding this comment.
Yes! I'm working on a PR on top of this one that uses this across multiple files (generate_and_evaluate.py, estimate_elo_ratings.py, generate.py)
There was a problem hiding this comment.
PR #32 also rewrites this area, but it threads a richer resolved-prompt object with parser-mode metadata through the judging pipeline. This PR adds a second prompt-resolution abstraction in the same file. Can we unify on one resolver / one return type before merging? Otherwise this will conflict both textually and semantically with #32
There was a problem hiding this comment.
You have more context on the codebase so I'll let you decide what's the best approach!
There was a problem hiding this comment.
To provide some more context, the follow-up PR I'm working on passes the prompt, and all the generation and sampling parameters to the different models (ChatVLLM, ChatOpenAI, etc) and writes them to the metadata.
|
@alexrs-cohere could you please check/review this PR which is the extracted part related to judge presets that I used for the paper submission. I also have a sequence of cascading PRs on top of that (see the numbered PRs). Feel free to add any suggestions for changes as you have thought about this more probably. Let me know if you have thoughts on the best way to proceed with the merging of these two PRs if you think both are needed. We could either merge this PR into that PR and propagate the changes to the cascading PRs on top, or we could merge those PRs first and then implement the necessary changes from this one at the end. I think the second option would be a bit easier as it requires only one rebase. |
|
We plan to merge #48 first because we may have lots of conflict / rebase otherwise. Will close this one once we have made sure that we support the same features. |
|
Implemented the features of this PR (and the wiring) in #48. |
Summary
Introduces a small named registry of judge prompts in
judgearena.prompts.registryso every benchmark gets a sensible default that is also recorded by hash in the run metadata. Four bundled presets:default— score-only system + user pair (used byalpaca-eval,arena-hard-v0.1,arena-hard-v2.0,m-arena-hard*).default_with_explanation— equivalent to today's--provide_explanation.fluency— the inline fluency system prompt that previously lived insidegenerate_and_evaluate.py, paired with the default user template.fastchat-pairwise— name-only delegate for the MT-Bench pipeline, whose category-aware prompts continue to be selected byfastchat_compat._select_prompt.TASK_DEFAULT_PRESETplus the prefix-awaredefault_preset_for_taskdecide which preset a task gets when the user does not pass--judge_prompt_preset.Three new CLI flags select the prompt:
--judge_prompt_preset NAME— pick a registered preset.--judge_system_prompt_file/--judge_user_prompt_file— fully custom prompts; both file overrides take precedence over the preset when set.--provide_explanationis preserved and is now equivalent to--judge_prompt_preset default_with_explanation.judgearena/evaluate.py::resolve_judge_promptsbecomes a thin backward-compatible shim around the registry, and a newresolve_run_judge_prompt(task, cli_args)returns the fullResolvedJudgePrompt(including SHAs/paths) so follow-up PRs can fold that hash into cache keys and run metadata.The CLI dispatcher forwards the three new fields through to both
CliArgsandCliEloArgs.generate_and_evaluate.pyis intentionally left unchanged in this PR: its existingresolve_judge_promptscall goes through the new shim and continues to behave exactly as before — a follow-up PR will switch the call site to the task-awareresolve_run_judge_promptand drop the now-redundant inline fluency system prompt.Why
Today the judge prompt is hard-coded to two on-disk templates and the fluency variant lives as an inline
system_prompt = \"\"\"...\"\"\"string ingenerate_and_evaluate.py. There is no way to:only SHA-256 hashes after-the-fact);
This PR is the smallest piece of plumbing that gives us a named registry + per-task default mapping + CLI surface; subsequent PRs in the reproducibility-hardening stack will fold the resolved prompt SHA into cache keys and write the verbatim templates next to each run.
Test plan
uv run pytest -qtests/test_prompt_registry.pycovers task defaults, explicit presets, theprovide_explanationalias, file overrides winning over presets, thefastchat-pairwisedelegate, and error paths.tests/test_cli.py::test_judge_prompt_preset_flag_is_forwardedverifies the new flag survives CLI dispatch.