Skip to content

use MTP if model_dir and draft_model_dir are equal#424

Open
suspicious-pineapple wants to merge 1 commit into
theroyallab:mainfrom
suspicious-pineapple:patch-2
Open

use MTP if model_dir and draft_model_dir are equal#424
suspicious-pineapple wants to merge 1 commit into
theroyallab:mainfrom
suspicious-pineapple:patch-2

Conversation

@suspicious-pineapple

Copy link
Copy Markdown

Why should this feature be added?
this seems to be the minimal set of changes needed to make MTP work, on latest exl3 dev branch.

Examples
MTP is enabled if the main model is the same as the draft model. otherwise it behaves normally
..maybe this would more sanely be exposed as a config option?

Additional context
tested with https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3 (gotta download the safetensors file and put it in the model dir, i assume it will be included by default in future quants, where supported)

this seems to be the minimal set of changes needed to make MTP work

tested with <https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3>
@randoentity

Copy link
Copy Markdown
Contributor

I can't get this to work yet.

I'm on exllamav3 9c5009efaa2cda8ed341369123bb4acfe18ae300
tabbyAPI 2e50555 + patch-2

Using https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3 and UnstableLlama_Qwen3.6-27B-exl3-8.00bpw
3 draft module layers get loaded but it raises an error on generation.

AI generated report below:

Bug Report: AttributeError during MTP Draft Model Generation

Description

When initiating a chat completion with Multi-Token Prediction (MTP) enabled via the ExLlamaV3 backend, the generation process crashes. The error indicates that a linear module's inner component (self.inner) is None when atte
mpting to perform a forward pass during draft model iteration.

Steps to Reproduce

  1. Configure TabbyAPI with the ExLlamaV3 backend.
  2. Load a model/architecture that utilizes MTP or a draft model.
  3. Send a chat completion request to trigger streaming generation.
  4. Monitor the server logs.

Expected Behavior

The model should successfully iterate through draft tokens using MTP and stream the completion without crashing.

Actual Behavior

The server raises an AttributeError: 'NoneType' object has no attribute 'forward' and aborts the generation request.

Error Log & Traceback Analysis

Critical Error:

AttributeError: 'NoneType' object has no attribute 'forward'
File "exllamav3/exllamav3/modules/linear.py", line 426, in forward
    x = self.inner.forward(x, params, out_dtype)

Call Stack Highlights:

  1. tabbyAPI/backends/exllamav3/model.py initiates generate_gen.
  2. exllamav3/exllamav3/generator/generator.py calls iterate_draftmodel_mtp_gen.
  3. At generator.py:525, it attempts: batch_logits = self.model.modules[self.model.logit_layer_idx].forward(batch_state, params)
  4. The forward pass enters linear.py:426 where self.inner is unexpectedly None.

Potential Causes

  • The draft model's linear layers were not correctly initialized or loaded from the state dictionary.
  • Architecture mismatch between the loaded model weights and the ExLlamaV3 module definition for MTP layers.
  • Missing or corrupted weight tensors for the specific logit layer index used in MTP drafting.

Environment

  • Python Version: 3.13
  • Backend: ExLlamaV3
  • Application: TabbyAPI
  • Date: 2026-06-10

Note: This bug report was drafted with the assistance of AI based on the provided traceback log.

@suspicious-pineapple

Copy link
Copy Markdown
Author

@randoentity
Can you share your config (for model and draft model) and the (non-summarized) stack trace?

@randoentity

randoentity commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
model:
  model_dir: models                                                                 inline_model_loading: true                                                        use_dummy_models: true
  dummy_model_names: ["local"]
  model_name: UnstableLlama_Qwen3.6-27B-exl3-8.00bpw
  backend: exllamav3                                                                max_seq_len: 131072                                                               tensor_parallel: true                                                             tensor_parallel_backend: native
  cache_mode: 8,8                                                                   cache_size: 131072                                                                max_batch_size: 1                                                                 vision: true                                                                      reasoning: true                                                                   reasoning_start_token: "<think>"                                                  reasoning_end_token: "</think>"
  tool_format: qwen3_coder
                                                                                  draft_model:
  draft_model_dir: models
  draft_model_name: UnstableLlama_Qwen3.6-27B-exl3-8.0bpw # (or nothing)

I tried a bunch of draft model variations and locations.

2026-06-10 09:38:42.047 ERROR:
Error during chat completion

2026-06-10 09:38:42.047 ERROR:
'NoneType' object has no attribute 'forward'

2026-06-10 09:38:42.050 ERROR:
Error Traceback (most recent call last):

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 572, in stream_generate_chat_completion

2026-06-10 09:38:42.050 ERROR:
raise generation

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 395, in _chat_stream_collector

2026-06-10 09:38:42.050 ERROR:
async for generation in new_generation:

2026-06-10 09:38:42.050 ERROR:
...\<97 lines>...

2026-06-10 09:38:42.050 ERROR:
break

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/backends/exllamav3/model.py", line 822, in stream_generate

2026-06-10 09:38:42.050 ERROR:
async for generation_chunk in

self.generate_gen(

2026-06-10 09:38:42.050 ERROR:
...\<7 lines>...

2026-06-10 09:38:42.050 ERROR:
yield generation_chunk

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/backends/exllamav3/model.py", line 1172, in generate_gen2026-06-10 09:38:42.050 ERROR:
raise ex

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/backends/exllamav3/model.py", line 1102, in generate_gen2026-06-10 09:38:42.050 ERROR:
async for result in job:

2026-06-10 09:38:42.050 ERROR:
...\<50 lines>...

2026-06-10 09:38:42.050 ERROR:
break

2026-06-10 09:38:42.050 ERROR:
File "exllamav3/exllamav3/generator/async_generator.py", line 109, in__aiter__

2026-06-10 09:38:42.050 ERROR:
raise result

2026-06-10 09:38:42.050 ERROR:
File "exllamav3/exllamav3/generator/async_generator.py", line 27, in _run_iteration

2026-06-10 09:38:42.050 ERROR:
results = self.generator.iterate()

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context

2026-06-10 09:38:42.050 ERROR:
return func(*args, **kwargs)

2026-06-10 09:38:42.050 ERROR:
File "exllamav3/exllamav3/generator/generator.py", line 357, in iterate

2026-06-10 09:38:42.050 ERROR:
draft_tokens = self.iterate_draftmodel_mtp_gen(results)

2026-06-10 09:38:42.050 ERROR:
File "exllamav3/exllamav3/generator/generator.py", line 525, in iterate_draftmodel_mtp_gen

2026-06-10 09:38:42.050 ERROR:
batch_logits = self.model.modules[self.model.logit_layer_idx].forward(batch_state, params)

2026-06-10 09:38:42.050 ERROR:
File "exllamav3/exllamav3/modules/linear.py", line 426, in forward

2026-06-10 09:38:42.050 ERROR:
x = self.inner.forward(x, params, out_dtype)

2026-06-10 09:38:42.050 ERROR:
^^^^^^^^^^^^^^^^^^

2026-06-10 09:38:42.050 ERROR:
AttributeError: 'NoneType' object has no attribute 'forward'

When I try without TP I get:

GPU assert: an illegal memory access was encountered exllamav3/ exllamav3/exllamav3_ext/quant/coop_autotune.cu 406

@turboderp

Copy link
Copy Markdown
Collaborator

There may still be issues with TP, but I've added MTP support now.

The illegal memory access happens when the MTP model and target model end up on different devices, but that's sorted in exllamav3==0.0.42 (released just now, and Tabby is updated)

To enable, set draft_mode: mtp in the draft config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants