Skip to content

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf#57

Open
ErlisLushtaku wants to merge 1 commit into
pr32-split-v2/02.5-mt-bench-preset-judgingfrom
pr32-split-v2/03.5-runtime-cli-followups
Open

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf#57
ErlisLushtaku wants to merge 1 commit into
pr32-split-v2/02.5-mt-bench-preset-judgingfrom
pr32-split-v2/03.5-runtime-cli-followups

Conversation

@ErlisLushtaku

Copy link
Copy Markdown
Collaborator

Summary

Runtime, CLI, and orchestration follow-ups split out of the closed PR #52. This intentionally excludes truncation/limit tracking, generation cache-key hashing, skip_judging, and the MT-Bench-specific changes (now in PR #55).

Changes

  • Runtime / model robustness (utils.py)
    • _init_llm_with_retry: retry vLLM construction on transient GPU-init signatures (e.g. cudaErrorDevicesUnavailable) with bounded backoff; reraise everything else immediately.
    • make_model: strip all vLLM-only engine kwargs (max_model_len, gpu_memory_utilization, tensor_parallel_size, kv_cache_dtype, …) for non-vLLM providers so remote APIs don't reject them.
    • ChatVLLM: configurable temperature/top_p with set_temperature, exposed tokenizer, and chat_template_kwargs plumbing.
  • Metadata propagation bugfix (utils.py)
    • _extract_ai_message_metadata + do_inference(return_metadata=...) and ChatVLLM.batch_with_metadata so finish_reason/stop_reason are preserved on both the async (use_tqdm) and batch paths.
    • Optional JUDGEARENA_JUDGE_MAX_CONCURRENCY cap on in-flight async invokes (unset = unbounded, prior behavior).
  • Judge engine defaults (utils.py, generate_and_evaluate.py)
    • build_default_judge_model_kwargs centralizes judge engine kwargs and adds an FP8 KV-cache default for FP8 judges; _build_judge_engine_kwargs now delegates to it.
  • CLI compatibility (cli.py)
    • Restore the deprecated --model alias, collapsed into --model_A via _resolve_model_a (errors if both are given).
  • Generation / evaluation orchestration (generate_and_evaluate.py, evaluate.py)
    • Replace functools.partial generation wiring with a single _run_generation helper; preserve the fluency instruction index; pass an explicit parser_mode into judge_and_parse_prefs.
  • Dataset/download maintenance (utils.py)
    • download_hf points at judge-arena/judge-arena-dataset; download_all is driven by M_ARENA_HARD_BASELINES.
  • Tests: metadata extraction, async/batch finish_reason propagation, vLLM init retry behavior, expanded make_model kwarg stripping, and baseline-plan resolution cases.

Supersedes #56 (closed): branch renamed 02-prompt-presets-localized03.5-runtime-cli-followups and rebased onto the PR #55 tip so it forms a clean linear stack. Same single-commit diff (6 runtime files).

…up, point to judgearena hf

- Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the
async and batch inference paths.
- Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration.
- Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from
M_ARENA_HARD_BASELINES.
@ErlisLushtaku ErlisLushtaku requested a review from kargibora June 1, 2026 14:29
Comment thread judgearena/cli.py
"--model_B",
help="Model B for generate+judge tasks (not yet supported for elo tasks).",
)
parser.add_argument(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just skimming through this PR and I was wondering about this.
I see the point of trying to have another flag, but isnt it better in this case since this not really model_A as the position are randomized and it may be confusing to users?

Comment thread judgearena/cli.py
return f"{ELO_TASK_PREFIX}{lower_arena}"


def _resolve_model_a(args: argparse.Namespace) -> str | None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating alias is a bit confusing to me. We should always be consistent, even if it breaks some previous codes no need to bring that back. More arguments (althoug more configurability) results in harder code.

@@ -191,9 +195,11 @@ def _resolve_baseline_plan(


def _build_judge_engine_kwargs(args: CliArgs) -> dict[str, object]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a mirror function but do we really need it?

Comment thread judgearena/utils.py
for chunk in chunks:
results.extend(chat_model.batch(inputs=chunk, **invoke_kwargs))
return results
if return_metadata and hasattr(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually never pass return_metadata=True. We only pass it for the tests. Can you check this behaviour?

Also it would be nice to have some docstring for non-local functions.

Comment thread judgearena/utils.py
"CUDA error: initialization error",
)
_VLLM_INIT_MAX_ATTEMPTS = int(os.getenv("JUDGEARENA_VLLM_INIT_MAX_ATTEMPTS", "4"))
_VLLM_INIT_BACKOFF_SECONDS = int(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly if we set JUDGEARENA_VLLM_INIT_BACKOFF_SECONDS=" ", then the entire CLI crashes as this is loaded before any functionality. Top-level functionality should never do these type of conversions without checking it. (If the input is malformed, better use default one or crash with a proper error logging)

Comment thread judgearena/utils.py
# JUDGEARENA_JUDGE_MAX_CONCURRENCY caps simultaneous in-flight ainvokes
# (e.g. against OpenRouter). Unset = unbounded, preserving prior behaviour.
cap_raw = os.environ.get("JUDGEARENA_JUDGE_MAX_CONCURRENCY")
cap = int(cap_raw) if cap_raw and int(cap_raw) > 0 else None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to problem above, int(cap_raw) is uncaught.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants