use MTP if model_dir and draft_model_dir are equal by suspicious-pineapple · Pull Request #424 · theroyallab/tabbyAPI

suspicious-pineapple · 2026-06-09T12:13:18Z

Why should this feature be added?
this seems to be the minimal set of changes needed to make MTP work, on latest exl3 dev branch.

Examples
MTP is enabled if the main model is the same as the draft model. otherwise it behaves normally
..maybe this would more sanely be exposed as a config option?

Additional context
tested with https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3 (gotta download the safetensors file and put it in the model dir, i assume it will be included by default in future quants, where supported)

this seems to be the minimal set of changes needed to make MTP work tested with <https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3>

randoentity · 2026-06-10T10:01:03Z

I can't get this to work yet.

I'm on exllamav3 9c5009efaa2cda8ed341369123bb4acfe18ae300
tabbyAPI 2e50555 + patch-2

Using https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3 and UnstableLlama_Qwen3.6-27B-exl3-8.00bpw
3 draft module layers get loaded but it raises an error on generation.

AI generated report below:

Bug Report: AttributeError during MTP Draft Model Generation

Description

When initiating a chat completion with Multi-Token Prediction (MTP) enabled via the ExLlamaV3 backend, the generation process crashes. The error indicates that a linear module's inner component (self.inner) is None when atte
mpting to perform a forward pass during draft model iteration.

Steps to Reproduce

Configure TabbyAPI with the ExLlamaV3 backend.
Load a model/architecture that utilizes MTP or a draft model.
Send a chat completion request to trigger streaming generation.
Monitor the server logs.

Expected Behavior

The model should successfully iterate through draft tokens using MTP and stream the completion without crashing.

Actual Behavior

The server raises an AttributeError: 'NoneType' object has no attribute 'forward' and aborts the generation request.

Error Log & Traceback Analysis

Critical Error:

AttributeError: 'NoneType' object has no attribute 'forward'
File "exllamav3/exllamav3/modules/linear.py", line 426, in forward
    x = self.inner.forward(x, params, out_dtype)

Call Stack Highlights:

tabbyAPI/backends/exllamav3/model.py initiates generate_gen.
exllamav3/exllamav3/generator/generator.py calls iterate_draftmodel_mtp_gen.
At generator.py:525, it attempts: batch_logits = self.model.modules[self.model.logit_layer_idx].forward(batch_state, params)
The forward pass enters linear.py:426 where self.inner is unexpectedly None.

Potential Causes

The draft model's linear layers were not correctly initialized or loaded from the state dictionary.
Architecture mismatch between the loaded model weights and the ExLlamaV3 module definition for MTP layers.
Missing or corrupted weight tensors for the specific logit layer index used in MTP drafting.

Environment

Python Version: 3.13
Backend: ExLlamaV3
Application: TabbyAPI
Date: 2026-06-10

Note: This bug report was drafted with the assistance of AI based on the provided traceback log.

suspicious-pineapple · 2026-06-11T11:32:13Z

@randoentity
Can you share your config (for model and draft model) and the (non-summarized) stack trace?

randoentity · 2026-06-11T17:22:36Z

model:
  model_dir: models                                                                 inline_model_loading: true                                                        use_dummy_models: true
  dummy_model_names: ["local"]
  model_name: UnstableLlama_Qwen3.6-27B-exl3-8.00bpw
  backend: exllamav3                                                                max_seq_len: 131072                                                               tensor_parallel: true                                                             tensor_parallel_backend: native
  cache_mode: 8,8                                                                   cache_size: 131072                                                                max_batch_size: 1                                                                 vision: true                                                                      reasoning: true                                                                   reasoning_start_token: "<think>"                                                  reasoning_end_token: "</think>"
  tool_format: qwen3_coder
                                                                                  draft_model:
  draft_model_dir: models
  draft_model_name: UnstableLlama_Qwen3.6-27B-exl3-8.0bpw # (or nothing)

I tried a bunch of draft model variations and locations.

2026-06-10 09:38:42.047 ERROR:
Error during chat completion

2026-06-10 09:38:42.047 ERROR:
'NoneType' object has no attribute 'forward'

2026-06-10 09:38:42.050 ERROR:
Error Traceback (most recent call last):

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 572, in stream_generate_chat_completion

2026-06-10 09:38:42.050 ERROR:
raise generation

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 395, in _chat_stream_collector

2026-06-10 09:38:42.050 ERROR:
async for generation in new_generation:

2026-06-10 09:38:42.050 ERROR:
...\<97 lines>...

2026-06-10 09:38:42.050 ERROR:
break

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/backends/exllamav3/model.py", line 822, in stream_generate

2026-06-10 09:38:42.050 ERROR:
async for generation_chunk in

self.generate_gen(

2026-06-10 09:38:42.050 ERROR:
...\<7 lines>...

2026-06-10 09:38:42.050 ERROR:
yield generation_chunk

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/backends/exllamav3/model.py", line 1172, in generate_gen2026-06-10 09:38:42.050 ERROR:
raise ex

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/backends/exllamav3/model.py", line 1102, in generate_gen2026-06-10 09:38:42.050 ERROR:
async for result in job:

2026-06-10 09:38:42.050 ERROR:
...\<50 lines>...

2026-06-10 09:38:42.050 ERROR:
break

2026-06-10 09:38:42.050 ERROR:
File "exllamav3/exllamav3/generator/async_generator.py", line 109, in__aiter__

2026-06-10 09:38:42.050 ERROR:
raise result

2026-06-10 09:38:42.050 ERROR:
File "exllamav3/exllamav3/generator/async_generator.py", line 27, in _run_iteration

2026-06-10 09:38:42.050 ERROR:
results = self.generator.iterate()

2026-06-10 09:38:42.050 ERROR:
File "tabbyAPI/.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context

2026-06-10 09:38:42.050 ERROR:
return func(*args, **kwargs)

2026-06-10 09:38:42.050 ERROR:
File "exllamav3/exllamav3/generator/generator.py", line 357, in iterate

2026-06-10 09:38:42.050 ERROR:
draft_tokens = self.iterate_draftmodel_mtp_gen(results)

2026-06-10 09:38:42.050 ERROR:
File "exllamav3/exllamav3/generator/generator.py", line 525, in iterate_draftmodel_mtp_gen

2026-06-10 09:38:42.050 ERROR:
batch_logits = self.model.modules[self.model.logit_layer_idx].forward(batch_state, params)

2026-06-10 09:38:42.050 ERROR:
File "exllamav3/exllamav3/modules/linear.py", line 426, in forward

2026-06-10 09:38:42.050 ERROR:
x = self.inner.forward(x, params, out_dtype)

2026-06-10 09:38:42.050 ERROR:
^^^^^^^^^^^^^^^^^^

2026-06-10 09:38:42.050 ERROR:
AttributeError: 'NoneType' object has no attribute 'forward'

When I try without TP I get:

GPU assert: an illegal memory access was encountered exllamav3/ exllamav3/exllamav3_ext/quant/coop_autotune.cu 406

turboderp · 2026-06-12T23:45:39Z

There may still be issues with TP, but I've added MTP support now.

The illegal memory access happens when the MTP model and target model end up on different devices, but that's sorted in exllamav3==0.0.42 (released just now, and Tabby is updated)

To enable, set draft_mode: mtp in the draft config.

use MTP if model_dir and draft_model_dir are equal

0b2a3cc

this seems to be the minimal set of changes needed to make MTP work tested with <https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3>

randoentity mentioned this pull request Jun 12, 2026

Fix mtp tp turboderp-org/exllamav3#224

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use MTP if model_dir and draft_model_dir are equal#424

use MTP if model_dir and draft_model_dir are equal#424
suspicious-pineapple wants to merge 1 commit into
theroyallab:mainfrom
suspicious-pineapple:patch-2

suspicious-pineapple commented Jun 9, 2026

Uh oh!

randoentity commented Jun 10, 2026

Uh oh!

suspicious-pineapple commented Jun 11, 2026

Uh oh!

randoentity commented Jun 11, 2026 •

edited

Loading

Uh oh!

turboderp commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

suspicious-pineapple commented Jun 9, 2026

Uh oh!

randoentity commented Jun 10, 2026

Bug Report: AttributeError during MTP Draft Model Generation

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Error Log & Traceback Analysis

Potential Causes

Environment

Uh oh!

suspicious-pineapple commented Jun 11, 2026

Uh oh!

randoentity commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboderp commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

randoentity commented Jun 11, 2026 •

edited

Loading