use MTP if model_dir and draft_model_dir are equal#424
use MTP if model_dir and draft_model_dir are equal#424suspicious-pineapple wants to merge 1 commit into
Conversation
this seems to be the minimal set of changes needed to make MTP work tested with <https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3>
|
I can't get this to work yet. I'm on exllamav3 9c5009efaa2cda8ed341369123bb4acfe18ae300 AI generated report below: Bug Report: AttributeError during MTP Draft Model GenerationDescriptionWhen initiating a chat completion with Multi-Token Prediction (MTP) enabled via the ExLlamaV3 backend, the generation process crashes. The error indicates that a linear module's inner component ( Steps to Reproduce
Expected BehaviorThe model should successfully iterate through draft tokens using MTP and stream the completion without crashing. Actual BehaviorThe server raises an Error Log & Traceback AnalysisCritical Error: Call Stack Highlights:
Potential Causes
Environment
Note: This bug report was drafted with the assistance of AI based on the provided traceback log. |
|
@randoentity |
model:
model_dir: models inline_model_loading: true use_dummy_models: true
dummy_model_names: ["local"]
model_name: UnstableLlama_Qwen3.6-27B-exl3-8.00bpw
backend: exllamav3 max_seq_len: 131072 tensor_parallel: true tensor_parallel_backend: native
cache_mode: 8,8 cache_size: 131072 max_batch_size: 1 vision: true reasoning: true reasoning_start_token: "<think>" reasoning_end_token: "</think>"
tool_format: qwen3_coder
draft_model:
draft_model_dir: models
draft_model_name: UnstableLlama_Qwen3.6-27B-exl3-8.0bpw # (or nothing)I tried a bunch of draft model variations and locations. When I try without TP I get:
|
|
There may still be issues with TP, but I've added MTP support now. The illegal memory access happens when the MTP model and target model end up on different devices, but that's sorted in exllamav3==0.0.42 (released just now, and Tabby is updated) To enable, set |
Why should this feature be added?
this seems to be the minimal set of changes needed to make MTP work, on latest exl3 dev branch.
Examples
MTP is enabled if the main model is the same as the draft model. otherwise it behaves normally
..maybe this would more sanely be exposed as a config option?
Additional context
tested with https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3 (gotta download the safetensors file and put it in the model dir, i assume it will be included by default in future quants, where supported)