3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf by ErlisLushtaku · Pull Request #56 · OpenEuroLLM/JudgeArena

ErlisLushtaku · 2026-06-01T13:57:05Z

Summary

Runtime, CLI, and orchestration follow-ups split out of the closed PR #52. This intentionally excludes truncation/limit tracking, generation cache-key hashing, skip_judging, and the MT-Bench-specific changes (now in PR #55).

Changes

Runtime / model robustness (utils.py)
- _init_llm_with_retry: retry vLLM construction on transient GPU-init signatures (e.g. cudaErrorDevicesUnavailable) with bounded backoff; reraise everything else immediately.
- make_model: strip all vLLM-only engine kwargs (max_model_len, gpu_memory_utilization, tensor_parallel_size, kv_cache_dtype, …) for non-vLLM providers so remote APIs don't reject them.
- ChatVLLM: configurable temperature/top_p with set_temperature, exposed tokenizer, and chat_template_kwargs plumbing.
Metadata propagation bugfix (utils.py)
- _extract_ai_message_metadata + do_inference(return_metadata=...) and ChatVLLM.batch_with_metadata so finish_reason/stop_reason are preserved on both the async (use_tqdm) and batch paths.
- Optional JUDGEARENA_JUDGE_MAX_CONCURRENCY cap on in-flight async invokes (unset = unbounded, prior behavior).
Judge engine defaults (utils.py, generate_and_evaluate.py)
- build_default_judge_model_kwargs centralizes judge engine kwargs and adds an FP8 KV-cache default for FP8 judges; _build_judge_engine_kwargs now delegates to it.
CLI compatibility (cli.py)
- Restore the deprecated --model alias, collapsed into --model_A via _resolve_model_a (errors if both are given).
Generation / evaluation orchestration (generate_and_evaluate.py, evaluate.py)
- Replace functools.partial generation wiring with a single _run_generation helper; preserve the fluency instruction index; pass an explicit parser_mode into judge_and_parse_prefs.
Dataset/download maintenance (utils.py)
- download_hf points at judge-arena/judge-arena-dataset; download_all is driven by M_ARENA_HARD_BASELINES.
Tests: metadata extraction, async/batch finish_reason propagation, vLLM init retry behavior, expanded make_model kwarg stripping, and baseline-plan resolution cases.

…up, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES.

ErlisLushtaku changed the title ~~Add runtime robustness, judge-kwarg defaults, CLI/orchestration clean…~~ 3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf Jun 1, 2026

ErlisLushtaku requested a review from kargibora June 1, 2026 14:00

ErlisLushtaku closed this Jun 1, 2026

ErlisLushtaku deleted the pr32-split-v2/02-prompt-presets-localized branch June 1, 2026 14:26

ErlisLushtaku mentioned this pull request Jun 1, 2026

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf #57

Open

ErlisLushtaku removed the request for review from kargibora June 1, 2026 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf#56

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf#56
ErlisLushtaku wants to merge 1 commit into
pr32-split-v2/02.5-mt-bench-preset-judgingfrom
pr32-split-v2/02-prompt-presets-localized

ErlisLushtaku commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ErlisLushtaku commented Jun 1, 2026

Summary

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant