Skip to content

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf#56

Closed
ErlisLushtaku wants to merge 1 commit into
pr32-split-v2/02.5-mt-bench-preset-judgingfrom
pr32-split-v2/02-prompt-presets-localized
Closed

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf#56
ErlisLushtaku wants to merge 1 commit into
pr32-split-v2/02.5-mt-bench-preset-judgingfrom
pr32-split-v2/02-prompt-presets-localized

Conversation

@ErlisLushtaku

Copy link
Copy Markdown
Collaborator

Summary

Runtime, CLI, and orchestration follow-ups split out of the closed PR #52. This intentionally excludes truncation/limit tracking, generation cache-key hashing, skip_judging, and the MT-Bench-specific changes (now in PR #55).

Changes

  • Runtime / model robustness (utils.py)
    • _init_llm_with_retry: retry vLLM construction on transient GPU-init signatures (e.g. cudaErrorDevicesUnavailable) with bounded backoff; reraise everything else immediately.
    • make_model: strip all vLLM-only engine kwargs (max_model_len, gpu_memory_utilization, tensor_parallel_size, kv_cache_dtype, …) for non-vLLM providers so remote APIs don't reject them.
    • ChatVLLM: configurable temperature/top_p with set_temperature, exposed tokenizer, and chat_template_kwargs plumbing.
  • Metadata propagation bugfix (utils.py)
    • _extract_ai_message_metadata + do_inference(return_metadata=...) and ChatVLLM.batch_with_metadata so finish_reason/stop_reason are preserved on both the async (use_tqdm) and batch paths.
    • Optional JUDGEARENA_JUDGE_MAX_CONCURRENCY cap on in-flight async invokes (unset = unbounded, prior behavior).
  • Judge engine defaults (utils.py, generate_and_evaluate.py)
    • build_default_judge_model_kwargs centralizes judge engine kwargs and adds an FP8 KV-cache default for FP8 judges; _build_judge_engine_kwargs now delegates to it.
  • CLI compatibility (cli.py)
    • Restore the deprecated --model alias, collapsed into --model_A via _resolve_model_a (errors if both are given).
  • Generation / evaluation orchestration (generate_and_evaluate.py, evaluate.py)
    • Replace functools.partial generation wiring with a single _run_generation helper; preserve the fluency instruction index; pass an explicit parser_mode into judge_and_parse_prefs.
  • Dataset/download maintenance (utils.py)
    • download_hf points at judge-arena/judge-arena-dataset; download_all is driven by M_ARENA_HARD_BASELINES.
  • Tests: metadata extraction, async/batch finish_reason propagation, vLLM init retry behavior, expanded make_model kwarg stripping, and baseline-plan resolution cases.

…up, point to judgearena hf

- Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the
async and batch inference paths.
- Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration.
- Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from
M_ARENA_HARD_BASELINES.
@ErlisLushtaku ErlisLushtaku changed the title Add runtime robustness, judge-kwarg defaults, CLI/orchestration clean… 3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf Jun 1, 2026
@ErlisLushtaku ErlisLushtaku requested a review from kargibora June 1, 2026 14:00
@ErlisLushtaku ErlisLushtaku deleted the pr32-split-v2/02-prompt-presets-localized branch June 1, 2026 14:26
@ErlisLushtaku ErlisLushtaku removed the request for review from kargibora June 1, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant