4. Thinking model support#53
Conversation
kargibora
left a comment
There was a problem hiding this comment.
I think the PR is pretty self-contained, however I think the implementation of enabling/parsing thinking tokens over-complicated.
I have several questions/concerns:
- We are introducing a lot of arguments like
thinking_token_budged. I did not really understand the importance of this, as we can not disable the thinking for many models (for some models you can setenable_thinking=False) Best things we can do is to prompt model not to "think" - Maybe we can make the completions and judge evaluations as a class on its own and intoduce these filtering, stripping steps there. Thus we can change the mode whenever we want and all of these utilities, parsings becomes self-contained. Or similarly, we can define some
Transformerclass that handles these pipelines such asParser,Filter,Truncater... We can define the pipeline once and then callprefilter(x)orpostfilter(y)for completions / judge evaluations. - Now we have more arguments, we should probably switch to
Configstyle in which we seperate the arguments into different configs and builders for examplecfg.judge.enable_thinking,cfg.pipeline_mode... We probably also want to use someYAMLconfigs which is more human readable and easy to over-ride default arguments
| return provider, model_name | ||
|
|
||
|
|
||
| def is_thinking_model(model_name: str) -> bool: |
There was a problem hiding this comment.
I think this is a bit limiting. Instead we can maybe do this:
model_id = "Qwen/Qwen3-8B"
path = hf_hub_download(
repo_id=model_id,
filename="tokenizer_config.json"
)
with open(path, "r", encoding="utf-8") as f:
cfg = json.load(f)
chat_template = cfg.get("chat_template", "")
signals = [
"enable_thinking",
"/think",
"/no_think",
"<think>",
"</think>",
]
print("Possible thinking support:")
for s in signals:
print(f"{s}: {s in chat_template}") # If true, there is a thinking modeThere was a problem hiding this comment.
The template-scan idea is quite nice, but it doesn't fit very well here, we need more than a boolean, the same map yields the specific vLLM reasoning_parser (qwen3 vs olmo3) that gets wired into the engine. The allowlist is intentionally a default: callers can pass reasoning_parser/reasoning_config explicitly for any model outside it, and unknown models get a warning rather than silent behavior.
That said, we could investigate further if we can get the parser from the template or in a different way as a general solution as a follow-up, but I'd keep the explicit map for now since we don't have a clear answer how to do this.
673b5fb to
d7f4232
Compare
|
Summary
--battle_thinking_token_budgetcontrol.<think>...</think>traces before judging / MT-Bench carryover prompts, and teach the parsing helpers/tests to handle reasoning traces safely.Test plan
uv run pytest tests/test_chat_vllm.py tests/test_utils.py tests/test_regexp.py tests/test_generate_and_evaluate.py tests/test_mt_bench_downloads.py