2. Add prompt presets by ErlisLushtaku · Pull Request #54 · OpenEuroLLM/JudgeArena

ErlisLushtaku · 2026-05-28T08:58:24Z

Summary

Move shared judge prompt preset resolution into its own stack layer.
Wire prompt preset selection through CLI, pairwise judging, and MT-Bench preset judging.

Stack

Next: pr32-split-v2/02.5-mt-bench-preset-judging.

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true

…pt-presets-localized

ErlisLushtaku · 2026-05-28T09:04:15Z

@kargibora this is the same as #48 I just split some logic related to mt-bench into another one #55 . If you want you can give it another review but it should be the same as what you saw before. #55 is new though.

kargibora

Thanks for the clean-up! I left some more comments after reading it throughly.

kargibora · 2026-06-08T09:28:45Z

+            provide_explanation=provide_explanation,
+        )
+
+    if preset is None:


If preset=None, we set it to DEFAULT_WITH_EXPLANATION_PRESET if explanation is provided. So we can not use task and provide_explanation at the same time. Is this intended?

kargibora · 2026-06-08T09:33:05Z

+        return TASK_DEFAULT_PRESET[task]
+    if task.startswith("m-arena-hard"):
+        return DEFAULT_JUDGE_PROMPT_PRESET
+    if task.startswith("fluency"):


I think checking tasks specifically for these keywords are not so optimal. If we use `-fleuncy it will fail for example. We can use asserts and task resolving structure for this one.

kargibora · 2026-06-08T10:02:41Z

+        )
+        _append_results(swapped_judgments, swapped_prompt_kwargs, swapped=True)
+
+    return pd.Series(preferences, dtype=float), annotations, metadata, 0


It is a bit confusing, we say return type is None but we return tuple with 4 elements?

kargibora · 2026-06-08T10:03:39Z

+    judge_chat_model,
+    resolved_prompt: ResolvedJudgePrompt,
+) -> pd.Series:
+    prefs, annotations, combined_metadata, _num_inconsistent = (


We return 0 in the judge_mt_bench_with_preset, even if we do not, we never use _num_inconsistent variable. We should either discard it ore use it properly.

ErlisLushtaku added 10 commits May 19, 2026 00:31

Add native baselines and judge controls

d0f4604

Fix versioned m-Arena-Hard download test

092c0e8

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true

remove backward compatibility code

28c21b8

revert

2213b68

Add prompt presets and localized judge prompts

6112785

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true

skywork prompt preset

366c911

Merge remote-tracking branch 'origin/main' into pr32-split-v2/02-prom…

dfa3231

…pt-presets-localized

Add clarifying comment about localized presets

3280534

remove localized and skywork presets

b1b3241

Merge pr 40 logic, add registry and task defaults

ec06ad3

ErlisLushtaku mentioned this pull request May 28, 2026

2. Add prompt presets and mt-bench changes #48

Closed

ErlisLushtaku requested a review from kargibora May 28, 2026 09:04

kargibora reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2. Add prompt presets#54

2. Add prompt presets#54
ErlisLushtaku wants to merge 10 commits into
mainfrom
pr32-split-v2/02-prompt-presets-only

ErlisLushtaku commented May 28, 2026 •

edited

Loading

Uh oh!

ErlisLushtaku commented May 28, 2026

Uh oh!

kargibora left a comment

Uh oh!

kargibora Jun 8, 2026

Uh oh!

kargibora Jun 8, 2026

Uh oh!

kargibora Jun 8, 2026

Uh oh!

kargibora Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ErlisLushtaku commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack

Uh oh!

ErlisLushtaku commented May 28, 2026

Uh oh!

kargibora left a comment

Choose a reason for hiding this comment

Uh oh!

kargibora Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

kargibora Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

kargibora Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

kargibora Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ErlisLushtaku commented May 28, 2026 •

edited

Loading