3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf by ErlisLushtaku · Pull Request #57 · OpenEuroLLM/JudgeArena

ErlisLushtaku · 2026-06-01T14:27:36Z

Summary

Runtime, CLI, and orchestration follow-ups split out of the closed PR #52. This intentionally excludes truncation/limit tracking, generation cache-key hashing, skip_judging, and the MT-Bench-specific changes (now in PR #55).

Changes

Runtime / model robustness (utils.py)
- _init_llm_with_retry: retry vLLM construction on transient GPU-init signatures (e.g. cudaErrorDevicesUnavailable) with bounded backoff; reraise everything else immediately.
- make_model: strip all vLLM-only engine kwargs (max_model_len, gpu_memory_utilization, tensor_parallel_size, kv_cache_dtype, …) for non-vLLM providers so remote APIs don't reject them.
- ChatVLLM: configurable temperature/top_p with set_temperature, exposed tokenizer, and chat_template_kwargs plumbing.
Metadata propagation bugfix (utils.py)
- _extract_ai_message_metadata + do_inference(return_metadata=...) and ChatVLLM.batch_with_metadata so finish_reason/stop_reason are preserved on both the async (use_tqdm) and batch paths.
- Optional JUDGEARENA_JUDGE_MAX_CONCURRENCY cap on in-flight async invokes (unset = unbounded, prior behavior).
Judge engine defaults (utils.py, generate_and_evaluate.py)
- build_default_judge_model_kwargs centralizes judge engine kwargs and adds an FP8 KV-cache default for FP8 judges; _build_judge_engine_kwargs now delegates to it.
CLI compatibility (cli.py)
- Restore the deprecated --model alias, collapsed into --model_A via _resolve_model_a (errors if both are given).
Generation / evaluation orchestration (generate_and_evaluate.py, evaluate.py)
- Replace functools.partial generation wiring with a single _run_generation helper; preserve the fluency instruction index; pass an explicit parser_mode into judge_and_parse_prefs.
Dataset/download maintenance (utils.py)
- download_hf points at judge-arena/judge-arena-dataset; download_all is driven by M_ARENA_HARD_BASELINES.
Tests: metadata extraction, async/batch finish_reason propagation, vLLM init retry behavior, expanded make_model kwarg stripping, and baseline-plan resolution cases.

Supersedes #56 (closed): branch renamed 02-prompt-presets-localized → 03.5-runtime-cli-followups and rebased onto the PR #55 tip so it forms a clean linear stack. Same single-commit diff (6 runtime files).

…up, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES.

geoalgo · 2026-06-02T14:09:52Z

        "--model_B",
        help="Model B for generate+judge tasks (not yet supported for elo tasks).",
    )
+    parser.add_argument(


I am just skimming through this PR and I was wondering about this.
I see the point of trying to have another flag, but isnt it better in this case since this not really model_A as the position are randomized and it may be confusing to users?

kargibora · 2026-06-08T10:37:21Z

    return f"{ELO_TASK_PREFIX}{lower_arena}"


+def _resolve_model_a(args: argparse.Namespace) -> str | None:


Creating alias is a bit confusing to me. We should always be consistent, even if it breaks some previous codes no need to bring that back. More arguments (althoug more configurability) results in harder code.

kargibora · 2026-06-08T10:39:02Z

@@ -191,9 +195,11 @@ def _resolve_baseline_plan(


 def _build_judge_engine_kwargs(args: CliArgs) -> dict[str, object]:


I know this is a mirror function but do we really need it?

kargibora · 2026-06-08T10:45:53Z

                    for chunk in chunks:
-                        results.extend(chat_model.batch(inputs=chunk, **invoke_kwargs))
-                    return results
+                        if return_metadata and hasattr(


We actually never pass return_metadata=True. We only pass it for the tests. Can you check this behaviour?

Also it would be nice to have some docstring for non-local functions.

kargibora · 2026-06-08T10:49:03Z

+    "CUDA error: initialization error",
+)
+_VLLM_INIT_MAX_ATTEMPTS = int(os.getenv("JUDGEARENA_VLLM_INIT_MAX_ATTEMPTS", "4"))
+_VLLM_INIT_BACKOFF_SECONDS = int(


Possibly if we set JUDGEARENA_VLLM_INIT_BACKOFF_SECONDS=" ", then the entire CLI crashes as this is loaded before any functionality. Top-level functionality should never do these type of conversions without checking it. (If the input is malformed, better use default one or crash with a proper error logging)

kargibora · 2026-06-08T10:50:16Z

+        # JUDGEARENA_JUDGE_MAX_CONCURRENCY caps simultaneous in-flight ainvokes
+        # (e.g. against OpenRouter). Unset = unbounded, preserving prior behaviour.
+        cap_raw = os.environ.get("JUDGEARENA_JUDGE_MAX_CONCURRENCY")
+        cap = int(cap_raw) if cap_raw and int(cap_raw) > 0 else None


Similar to problem above, int(cap_raw) is uncaught.

ErlisLushtaku requested a review from kargibora June 1, 2026 14:29

geoalgo reviewed Jun 2, 2026

View reviewed changes

kargibora reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf#57

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf#57
ErlisLushtaku wants to merge 1 commit into
pr32-split-v2/02.5-mt-bench-preset-judgingfrom
pr32-split-v2/03.5-runtime-cli-followups

ErlisLushtaku commented Jun 1, 2026

Uh oh!

geoalgo Jun 2, 2026

Uh oh!

kargibora Jun 8, 2026

Uh oh!

kargibora Jun 8, 2026

Uh oh!

kargibora Jun 8, 2026

Uh oh!

kargibora Jun 8, 2026

Uh oh!

kargibora Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return f"{ELO_TASK_PREFIX}{lower_arena}"


		def _resolve_model_a(args: argparse.Namespace) -> str \| None:

		@@ -191,9 +195,11 @@ def _resolve_baseline_plan(


		def _build_judge_engine_kwargs(args: CliArgs) -> dict[str, object]:

Conversation

ErlisLushtaku commented Jun 1, 2026

Summary

Changes

Uh oh!

geoalgo Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

kargibora Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

kargibora Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

kargibora Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

kargibora Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

kargibora Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants