Fix draft model ignoring draft_gpu_split on load#425
Merged
turboderp merged 1 commit intoJun 12, 2026
Conversation
The exllamav3 backend parses the user-configured draft_gpu_split into self.draft_gpu_split, but load_model_sync passed self.gpu_split (the main model's split) when loading the draft model, so the draft split was silently ignored. Use self.draft_gpu_split instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collaborator
|
👍 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Is your pull request related to a problem? Please describe.
The exllamav3 backend reads the user-configured
draft_gpu_splitoption intoself.draft_gpu_split(backends/exllamav3/model.py:172), butself.draft_gpu_splitis never actually used anywhere. When loading the draft model inload_model_sync, the code passesself.gpu_split— the main model's split — todraft_model.load_gen():As a result, the user's
draft_gpu_splitsetting is silently ignored and the draft model is loaded using the main model's GPU split instead.Why should this change be made?
So that the
draft_gpu_splitconfig option actually takes effect. Users who want to place the draft model on a specific GPU / with a specific split currently have no working way to do so on the exllamav3 backend.Examples
The fix is a one-line change — use the draft model's own split when loading the draft model:
Additional context
Single-line change in
backends/exllamav3/model.py.self.draft_gpu_splitis already parsed and defaults to[](matching the prior behavior when no draft split is configured), so this has no effect for users who don't set the option.