fix: architecture-aware n_embd read before vocab_only early return for mmproj init#22348
fix: architecture-aware n_embd read before vocab_only early return for mmproj init#22348ChenYFan wants to merge 6 commits into
Conversation
Move LLM_KV_EMBEDDING_LENGTH read before the vocab_only/CLIP early return so llama_model_n_embd_inp() returns the correct value even with vocab_only=true. Without this, mmproj init fails with "mismatch between text model (n_embd=0) and mmproj (n_embd=5120)".
|
Hi @ChenYFan, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
There was a problem hiding this comment.
Pull request overview
Adjusts llama_model::load_hparams() so hparams.n_embd is available even when loading a model with vocab_only=true, which is needed for mtmd/mmproj initialization paths that query llama_model_n_embd_inp().
Changes:
- Reads
LLM_KV_EMBEDDING_LENGTHbefore thevocab_only/LLM_ARCH_CLIPearly return inload_hparams().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // read n_embd before vocab_only early return, needed by mmproj | ||
| ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false); | ||
|
|
There was a problem hiding this comment.
This change fixes a previously-uninitialized n_embd in vocab_only mode, but there doesn't appear to be a regression test that asserts llama_model_n_embd_inp() is non-zero after a vocab_only=true load. Adding a small assertion in an existing vocab-only test (e.g. tokenizer or llguidance tests) would prevent future regressions of the mmproj initialization issue.
| // read n_embd before vocab_only early return, needed by mmproj | ||
| ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false); | ||
|
|
||
| // everything past this point is not vocab-related | ||
| // for CLIP models, we only need to load tensors, no hparams | ||
| if (hparams.vocab_only || ml.get_arch() == LLM_ARCH_CLIP) { |
There was a problem hiding this comment.
This eagerly reads LLM_KV_EMBEDDING_LENGTH for all architectures, but some arch-specific paths later intentionally repurpose/override hparams.n_embd (e.g. LLM_ARCH_WAVTOKENIZER_DEC sets n_embd from LLM_KV_FEATURES_LENGTH and uses LLM_KV_EMBEDDING_LENGTH for n_embd_out_impl). In vocab_only mode the early return prevents those overrides, so hparams.n_embd can become incorrect for those architectures. Consider limiting this read to the vocab_only case and/or making it architecture-aware (e.g. skip/adjust for LLM_ARCH_CLIP and LLM_ARCH_WAVTOKENIZER_DEC) to avoid changing semantics outside the mmproj/text-model use case.
| // read n_embd before vocab_only early return, needed by mmproj | |
| ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false); | |
| // everything past this point is not vocab-related | |
| // for CLIP models, we only need to load tensors, no hparams | |
| if (hparams.vocab_only || ml.get_arch() == LLM_ARCH_CLIP) { | |
| const auto arch = ml.get_arch(); | |
| // read n_embd before vocab_only early return only for architectures | |
| // where LLM_KV_EMBEDDING_LENGTH directly represents hparams.n_embd | |
| // (needed by mmproj/text-model vocab-only loading) | |
| if (hparams.vocab_only && | |
| arch != LLM_ARCH_CLIP && | |
| arch != LLM_ARCH_WAVTOKENIZER_DEC) { | |
| ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false); | |
| } | |
| // everything past this point is not vocab-related | |
| // for CLIP models, we only need to load tensors, no hparams | |
| if (hparams.vocab_only || arch == LLM_ARCH_CLIP) { |
|
In your use case you would also need to read |
ngxson
left a comment
There was a problem hiding this comment.
the metadata is now loaded twice, I don't think it's the expected behavior
Yes, I may not have taken into account that other models require loading additional fields. I originally designed this architecture specifically for Qwen 3.6's mmproj module; the primary objective was to enable the loading of mmproj solely by loading the vocabulary—without the need to load the entire model. In my project, this allows me to bridge the original embedding configurations, thereby enabling the generation of embeddings without having to load the main model. |
To avoid potential side effects, I decided to opt for the simplest approach: pre-loading the data and subsequently overwriting it. If you feel that this redundant loading is undesirable, I can attempt to eliminate the subsequent loading step. In other words, I would move the entire vocabulary loading process to the very beginning. |
Summary
Read
LLM_KV_EMBEDDING_LENGTHandLLM_KV_EMBEDDING_LENGTH_OUTbefore the vocab_only/CLIP early return so llama_model_n_embd_inp() returns the correct value even with vocab_only=true. Without this, mmproj init fails with "mismatch between text model (n_embd=0) and mmproj (n_embd=5120)".Problem
When loading a model with
vocab_only=trueand then initializing mmproj via mtmd_init_from_file, llama_model_n_embd_inp() returns 0 because hparams.n_embd is never populated — the vocab_only early return exits beforeLLM_KV_EMBEDDING_LENGTHis parsed.The previous fix unconditionally read
LLM_KV_EMBEDDING_LENGTHinto hparams.n_embd for all architectures before the early return. This introduced two issues:Incorrect semantics for special architectures —
LLM_ARCH_WAVTOKENIZER_DECintentionally sets n_embd fromLM_KV_FEATURES_LENGTHand repurposesLLM_KV_EMBEDDING_LENGTHfor n_embd_out_impl. In vocab_only mode the early return prevented these overrides, leaving hparams.n_embd with an incorrect value.Duplicate metadata read —
LLM_KV_EMBEDDING_LENGTHwas read once before the early return and again unconditionally after it, resulting in redundant parsing for all non-vocab_only loads.Changes
Architecture-aware guard: the pre-return read of
LLM_KV_EMBEDDING_LENGTHis now gated onhparams.vocab_only && arch != LLM_ARCH_CLIP && arch != LLM_ARCH_WAVTOKENIZER_DEC, so it only fires for architectures where the key directly representshparams.n_embd.Also read
LLM_KV_EMBEDDING_LENGTH_OUT(hparams.n_embd_out_impl, optional) in the same guard, since certain models need it even in vocab_only mode.Use arch member variable instead of calling
ml.get_arch()again in the CLIP early-return condition.No duplicate reads: vocab_only path returns early after reading; non-vocab_only path reads only at its normal location.
Use case
Offline image embedding extraction using only the tokenizer + mmproj (no full model weights). Loading with
vocab_only=trueis ~6× faster (0.88s vs 5.4s) and uses no weight memory, while producing identical embeddings.For a practical example of embedding decomposition and offline mmproj loading, see: https://gist.github.com/ChenYFan/ee5b0441e857b09135c3b2269c88f3a6
Requirements
I have read and agree with the contributing guidelines