Skip to content

4. Thinking model support#53

Open
ErlisLushtaku wants to merge 1 commit into
pr32-split-v2/03.5-runtime-cli-followupsfrom
pr32-split-v3/04-thinking-model-support
Open

4. Thinking model support#53
ErlisLushtaku wants to merge 1 commit into
pr32-split-v2/03.5-runtime-cli-followupsfrom
pr32-split-v3/04-thinking-model-support

Conversation

@ErlisLushtaku

@ErlisLushtaku ErlisLushtaku commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add thinking-model runtime support for vLLM-based judges and battle models, including built-in reasoning-parser defaults for Qwen3, SmolLM3, and OLMo-Think plus optional --battle_thinking_token_budget control.
  • Add optional stripping of visible <think>...</think> traces before judging / MT-Bench carryover prompts, and teach the parsing helpers/tests to handle reasoning traces safely.
  • Update the local vLLM/uv dependency setup needed for the newer reasoning-capable runtime, and add focused coverage for the new ChatVLLM thinking behavior.

Test plan

  • uv run pytest tests/test_chat_vllm.py tests/test_utils.py tests/test_regexp.py tests/test_generate_and_evaluate.py tests/test_mt_bench_downloads.py

@ErlisLushtaku ErlisLushtaku changed the title Thinking model support 4. Thinking model support May 19, 2026
@ErlisLushtaku ErlisLushtaku requested a review from kargibora May 25, 2026 22:13

@kargibora kargibora left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PR is pretty self-contained, however I think the implementation of enabling/parsing thinking tokens over-complicated.

I have several questions/concerns:

  1. We are introducing a lot of arguments like thinking_token_budged. I did not really understand the importance of this, as we can not disable the thinking for many models (for some models you can set enable_thinking=False) Best things we can do is to prompt model not to "think"
  2. Maybe we can make the completions and judge evaluations as a class on its own and intoduce these filtering, stripping steps there. Thus we can change the mode whenever we want and all of these utilities, parsings becomes self-contained. Or similarly, we can define some Transformer class that handles these pipelines such as Parser, Filter, Truncater... We can define the pipeline once and then call prefilter(x) or postfilter(y) for completions / judge evaluations.
  3. Now we have more arguments, we should probably switch to Config style in which we seperate the arguments into different configs and builders for example cfg.judge.enable_thinking, cfg.pipeline_mode... We probably also want to use some YAML configs which is more human readable and easy to over-ride default arguments

Comment thread judgearena/utils.py
return provider, model_name


def is_thinking_model(model_name: str) -> bool:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit limiting. Instead we can maybe do this:

model_id = "Qwen/Qwen3-8B"

path = hf_hub_download(
    repo_id=model_id,
    filename="tokenizer_config.json"
)

with open(path, "r", encoding="utf-8") as f:
    cfg = json.load(f)

chat_template = cfg.get("chat_template", "")

signals = [
    "enable_thinking",
    "/think",
    "/no_think",
    "<think>",
    "</think>",
]

print("Possible thinking support:")
for s in signals:
    print(f"{s}: {s in chat_template}") # If true, there is a thinking mode

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template-scan idea is quite nice, but it doesn't fit very well here, we need more than a boolean, the same map yields the specific vLLM reasoning_parser (qwen3 vs olmo3) that gets wired into the engine. The allowlist is intentionally a default: callers can pass reasoning_parser/reasoning_config explicitly for any model outside it, and unknown models get a warning rather than silent behavior.

That said, we could investigate further if we can get the parser from the template or in a different way as a general solution as a follow-up, but I'd keep the explicit map for now since we don't have a clear answer how to do this.

Comment thread judgearena/mt_bench/fastchat_compat.py Outdated
@ErlisLushtaku ErlisLushtaku force-pushed the pr32-split-v3/04-thinking-model-support branch from 673b5fb to d7f4232 Compare June 1, 2026 22:03
@ErlisLushtaku ErlisLushtaku changed the base branch from pr32-split-v3/03-metadata-truncation-tracking to pr32-split-v2/03.5-runtime-cli-followups June 1, 2026 22:04
@ErlisLushtaku

Copy link
Copy Markdown
Collaborator Author

I think the PR is pretty self-contained, however I think the implementation of enabling/parsing thinking tokens over-complicated.

I have several questions/concerns:

  1. We are introducing a lot of arguments like thinking_token_budged. I did not really understand the importance of this, as we can not disable the thinking for many models (for some models you can set enable_thinking=False) Best things we can do is to prompt model not to "think"
  2. Maybe we can make the completions and judge evaluations as a class on its own and intoduce these filtering, stripping steps there. Thus we can change the mode whenever we want and all of these utilities, parsings becomes self-contained. Or similarly, we can define some Transformer class that handles these pipelines such as Parser, Filter, Truncater... We can define the pipeline once and then call prefilter(x) or postfilter(y) for completions / judge evaluations.
  3. Now we have more arguments, we should probably switch to Config style in which we seperate the arguments into different configs and builders for example cfg.judge.enable_thinking, cfg.pipeline_mode... We probably also want to use some YAML configs which is more human readable and easy to over-ride default arguments
  1. The surface is just two flags (--battle_thinking_token_budget, --strip_thinking_before_judging); the rest are internal engine kwargs. On the budget specifically: at temp=0 we've repeatedly seen reasoning models (Qwen3.5/SmolLM3) emit very long or looping <think> traces that blow context and cost (and also don't give verdict if used as a judge), so a hard token cap is a deterministic guard that prompt-only "don't think" instructions don't reliably provide. Where a model supports it we can set enable_thinking=False however that would diminish the potential gains of thinking models which we wanted to be able to use.
    2 and 3. Agree both are worth doing, but they're repo-wide architecture changes beyond this PR. We could open separate issues for both so we can design them properly without blocking thinking-model support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants