4. Thinking model support by ErlisLushtaku · Pull Request #53 · OpenEuroLLM/JudgeArena

ErlisLushtaku · 2026-05-19T14:00:18Z

Summary

Add thinking-model runtime support for vLLM-based judges and battle models, including built-in reasoning-parser defaults for Qwen3, SmolLM3, and OLMo-Think plus optional --battle_thinking_token_budget control.
Add optional stripping of visible <think>...</think> traces before judging / MT-Bench carryover prompts, and teach the parsing helpers/tests to handle reasoning traces safely.
Update the local vLLM/uv dependency setup needed for the newer reasoning-capable runtime, and add focused coverage for the new ChatVLLM thinking behavior.

Test plan

uv run pytest tests/test_chat_vllm.py tests/test_utils.py tests/test_regexp.py tests/test_generate_and_evaluate.py tests/test_mt_bench_downloads.py

kargibora

I think the PR is pretty self-contained, however I think the implementation of enabling/parsing thinking tokens over-complicated.

I have several questions/concerns:

We are introducing a lot of arguments like thinking_token_budged. I did not really understand the importance of this, as we can not disable the thinking for many models (for some models you can set enable_thinking=False) Best things we can do is to prompt model not to "think"
Maybe we can make the completions and judge evaluations as a class on its own and intoduce these filtering, stripping steps there. Thus we can change the mode whenever we want and all of these utilities, parsings becomes self-contained. Or similarly, we can define some Transformer class that handles these pipelines such as Parser, Filter, Truncater... We can define the pipeline once and then call prefilter(x) or postfilter(y) for completions / judge evaluations.
Now we have more arguments, we should probably switch to Config style in which we seperate the arguments into different configs and builders for example cfg.judge.enable_thinking, cfg.pipeline_mode... We probably also want to use some YAML configs which is more human readable and easy to over-ride default arguments

kargibora · 2026-05-26T10:56:08Z

    return provider, model_name


+def is_thinking_model(model_name: str) -> bool:


I think this is a bit limiting. Instead we can maybe do this:

model_id = "Qwen/Qwen3-8B" path = hf_hub_download( repo_id=model_id, filename="tokenizer_config.json" ) with open(path, "r", encoding="utf-8") as f: cfg = json.load(f) chat_template = cfg.get("chat_template", "") signals = [ "enable_thinking", "/think", "/no_think", "<think>", "</think>", ] print("Possible thinking support:") for s in signals: print(f"{s}: {s in chat_template}") # If true, there is a thinking mode

The template-scan idea is quite nice, but it doesn't fit very well here, we need more than a boolean, the same map yields the specific vLLM reasoning_parser (qwen3 vs olmo3) that gets wired into the engine. The allowlist is intentionally a default: callers can pass reasoning_parser/reasoning_config explicitly for any model outside it, and unknown models get a warning rather than silent behavior.

That said, we could investigate further if we can get the parser from the template or in a different way as a general solution as a follow-up, but I'd keep the explicit map for now since we don't have a clear answer how to do this.

ErlisLushtaku · 2026-06-02T13:15:31Z

I think the PR is pretty self-contained, however I think the implementation of enabling/parsing thinking tokens over-complicated.

I have several questions/concerns:

We are introducing a lot of arguments like thinking_token_budged. I did not really understand the importance of this, as we can not disable the thinking for many models (for some models you can set enable_thinking=False) Best things we can do is to prompt model not to "think"

Maybe we can make the completions and judge evaluations as a class on its own and intoduce these filtering, stripping steps there. Thus we can change the mode whenever we want and all of these utilities, parsings becomes self-contained. Or similarly, we can define some Transformer class that handles these pipelines such as Parser, Filter, Truncater... We can define the pipeline once and then call prefilter(x) or postfilter(y) for completions / judge evaluations.

Now we have more arguments, we should probably switch to Config style in which we seperate the arguments into different configs and builders for example cfg.judge.enable_thinking, cfg.pipeline_mode... We probably also want to use some YAML configs which is more human readable and easy to over-ride default arguments

The surface is just two flags (--battle_thinking_token_budget, --strip_thinking_before_judging); the rest are internal engine kwargs. On the budget specifically: at temp=0 we've repeatedly seen reasoning models (Qwen3.5/SmolLM3) emit very long or looping <think> traces that blow context and cost (and also don't give verdict if used as a judge), so a hard token cap is a deterministic guard that prompt-only "don't think" instructions don't reliably provide. Where a model supports it we can set enable_thinking=False however that would diminish the potential gains of thinking models which we wanted to be able to use.
2 and 3. Agree both are worth doing, but they're repo-wide architecture changes beyond this PR. We could open separate issues for both so we can design them properly without blocking thinking-model support.

ErlisLushtaku changed the title ~~Thinking model support~~ 4. Thinking model support May 19, 2026

ErlisLushtaku requested a review from kargibora May 25, 2026 22:13

kargibora reviewed May 26, 2026

View reviewed changes

Thinking model support (linearized net diff)

d7f4232

ErlisLushtaku force-pushed the pr32-split-v3/04-thinking-model-support branch from 673b5fb to d7f4232 Compare June 1, 2026 22:03

ErlisLushtaku changed the base branch from pr32-split-v3/03-metadata-truncation-tracking to pr32-split-v2/03.5-runtime-cli-followups June 1, 2026 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4. Thinking model support#53

4. Thinking model support#53
ErlisLushtaku wants to merge 1 commit into
pr32-split-v2/03.5-runtime-cli-followupsfrom
pr32-split-v3/04-thinking-model-support

ErlisLushtaku commented May 19, 2026 •

edited

Loading

Uh oh!

kargibora left a comment

Uh oh!

kargibora May 26, 2026

Uh oh!

ErlisLushtaku Jun 2, 2026

Uh oh!

Uh oh!

ErlisLushtaku commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return provider, model_name


		def is_thinking_model(model_name: str) -> bool:

Conversation

ErlisLushtaku commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

kargibora left a comment

Choose a reason for hiding this comment

Uh oh!

kargibora May 26, 2026

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ErlisLushtaku commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ErlisLushtaku commented May 19, 2026 •

edited

Loading