3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf#56
Closed
ErlisLushtaku wants to merge 1 commit into
Conversation
…up, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Runtime, CLI, and orchestration follow-ups split out of the closed PR #52. This intentionally excludes truncation/limit tracking, generation cache-key hashing,
skip_judging, and the MT-Bench-specific changes (now in PR #55).Changes
utils.py)_init_llm_with_retry: retry vLLM construction on transient GPU-init signatures (e.g.cudaErrorDevicesUnavailable) with bounded backoff; reraise everything else immediately.make_model: strip all vLLM-only engine kwargs (max_model_len,gpu_memory_utilization,tensor_parallel_size,kv_cache_dtype, …) for non-vLLM providers so remote APIs don't reject them.ChatVLLM: configurabletemperature/top_pwithset_temperature, exposed tokenizer, andchat_template_kwargsplumbing.utils.py)_extract_ai_message_metadata+do_inference(return_metadata=...)andChatVLLM.batch_with_metadatasofinish_reason/stop_reasonare preserved on both the async (use_tqdm) and batch paths.JUDGEARENA_JUDGE_MAX_CONCURRENCYcap on in-flight async invokes (unset = unbounded, prior behavior).utils.py,generate_and_evaluate.py)build_default_judge_model_kwargscentralizes judge engine kwargs and adds an FP8 KV-cache default for FP8 judges;_build_judge_engine_kwargsnow delegates to it.cli.py)--modelalias, collapsed into--model_Avia_resolve_model_a(errors if both are given).generate_and_evaluate.py,evaluate.py)functools.partialgeneration wiring with a single_run_generationhelper; preserve the fluency instruction index; pass an explicitparser_modeintojudge_and_parse_prefs.utils.py)download_hfpoints atjudge-arena/judge-arena-dataset;download_allis driven byM_ARENA_HARD_BASELINES.finish_reasonpropagation, vLLM init retry behavior, expandedmake_modelkwarg stripping, and baseline-plan resolution cases.