3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf#57
Conversation
…up, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES.
| "--model_B", | ||
| help="Model B for generate+judge tasks (not yet supported for elo tasks).", | ||
| ) | ||
| parser.add_argument( |
There was a problem hiding this comment.
I am just skimming through this PR and I was wondering about this.
I see the point of trying to have another flag, but isnt it better in this case since this not really model_A as the position are randomized and it may be confusing to users?
| return f"{ELO_TASK_PREFIX}{lower_arena}" | ||
|
|
||
|
|
||
| def _resolve_model_a(args: argparse.Namespace) -> str | None: |
There was a problem hiding this comment.
Creating alias is a bit confusing to me. We should always be consistent, even if it breaks some previous codes no need to bring that back. More arguments (althoug more configurability) results in harder code.
| @@ -191,9 +195,11 @@ def _resolve_baseline_plan( | |||
|
|
|||
|
|
|||
| def _build_judge_engine_kwargs(args: CliArgs) -> dict[str, object]: | |||
There was a problem hiding this comment.
I know this is a mirror function but do we really need it?
| for chunk in chunks: | ||
| results.extend(chat_model.batch(inputs=chunk, **invoke_kwargs)) | ||
| return results | ||
| if return_metadata and hasattr( |
There was a problem hiding this comment.
We actually never pass return_metadata=True. We only pass it for the tests. Can you check this behaviour?
Also it would be nice to have some docstring for non-local functions.
| "CUDA error: initialization error", | ||
| ) | ||
| _VLLM_INIT_MAX_ATTEMPTS = int(os.getenv("JUDGEARENA_VLLM_INIT_MAX_ATTEMPTS", "4")) | ||
| _VLLM_INIT_BACKOFF_SECONDS = int( |
There was a problem hiding this comment.
Possibly if we set JUDGEARENA_VLLM_INIT_BACKOFF_SECONDS=" ", then the entire CLI crashes as this is loaded before any functionality. Top-level functionality should never do these type of conversions without checking it. (If the input is malformed, better use default one or crash with a proper error logging)
| # JUDGEARENA_JUDGE_MAX_CONCURRENCY caps simultaneous in-flight ainvokes | ||
| # (e.g. against OpenRouter). Unset = unbounded, preserving prior behaviour. | ||
| cap_raw = os.environ.get("JUDGEARENA_JUDGE_MAX_CONCURRENCY") | ||
| cap = int(cap_raw) if cap_raw and int(cap_raw) > 0 else None |
There was a problem hiding this comment.
Similar to problem above, int(cap_raw) is uncaught.
Summary
Runtime, CLI, and orchestration follow-ups split out of the closed PR #52. This intentionally excludes truncation/limit tracking, generation cache-key hashing,
skip_judging, and the MT-Bench-specific changes (now in PR #55).Changes
utils.py)_init_llm_with_retry: retry vLLM construction on transient GPU-init signatures (e.g.cudaErrorDevicesUnavailable) with bounded backoff; reraise everything else immediately.make_model: strip all vLLM-only engine kwargs (max_model_len,gpu_memory_utilization,tensor_parallel_size,kv_cache_dtype, …) for non-vLLM providers so remote APIs don't reject them.ChatVLLM: configurabletemperature/top_pwithset_temperature, exposed tokenizer, andchat_template_kwargsplumbing.utils.py)_extract_ai_message_metadata+do_inference(return_metadata=...)andChatVLLM.batch_with_metadatasofinish_reason/stop_reasonare preserved on both the async (use_tqdm) and batch paths.JUDGEARENA_JUDGE_MAX_CONCURRENCYcap on in-flight async invokes (unset = unbounded, prior behavior).utils.py,generate_and_evaluate.py)build_default_judge_model_kwargscentralizes judge engine kwargs and adds an FP8 KV-cache default for FP8 judges;_build_judge_engine_kwargsnow delegates to it.cli.py)--modelalias, collapsed into--model_Avia_resolve_model_a(errors if both are given).generate_and_evaluate.py,evaluate.py)functools.partialgeneration wiring with a single_run_generationhelper; preserve the fluency instruction index; pass an explicitparser_modeintojudge_and_parse_prefs.utils.py)download_hfpoints atjudge-arena/judge-arena-dataset;download_allis driven byM_ARENA_HARD_BASELINES.finish_reasonpropagation, vLLM init retry behavior, expandedmake_modelkwarg stripping, and baseline-plan resolution cases.Supersedes #56 (closed): branch renamed
02-prompt-presets-localized→03.5-runtime-cli-followupsand rebased onto the PR #55 tip so it forms a clean linear stack. Same single-commit diff (6 runtime files).