Skip to content

fix: architecture-aware n_embd read before vocab_only early return for mmproj init#22348

Open
ChenYFan wants to merge 6 commits into
ggml-org:masterfrom
ChenYFan:master
Open

fix: architecture-aware n_embd read before vocab_only early return for mmproj init#22348
ChenYFan wants to merge 6 commits into
ggml-org:masterfrom
ChenYFan:master

Conversation

@ChenYFan

@ChenYFan ChenYFan commented Apr 25, 2026

Copy link
Copy Markdown

Summary

Read LLM_KV_EMBEDDING_LENGTH and LLM_KV_EMBEDDING_LENGTH_OUT before the vocab_only/CLIP early return so llama_model_n_embd_inp() returns the correct value even with vocab_only=true. Without this, mmproj init fails with "mismatch between text model (n_embd=0) and mmproj (n_embd=5120)".

Problem

When loading a model with vocab_only=true and then initializing mmproj via mtmd_init_from_file, llama_model_n_embd_inp() returns 0 because hparams.n_embd is never populated — the vocab_only early return exits before LLM_KV_EMBEDDING_LENGTH is parsed.

The previous fix unconditionally read LLM_KV_EMBEDDING_LENGTH into hparams.n_embd for all architectures before the early return. This introduced two issues:

Incorrect semantics for special architectures — LLM_ARCH_WAVTOKENIZER_DEC intentionally sets n_embd from LM_KV_FEATURES_LENGTH and repurposes LLM_KV_EMBEDDING_LENGTH for n_embd_out_impl. In vocab_only mode the early return prevented these overrides, leaving hparams.n_embd with an incorrect value.
Duplicate metadata read — LLM_KV_EMBEDDING_LENGTH was read once before the early return and again unconditionally after it, resulting in redundant parsing for all non-vocab_only loads.

Changes

Architecture-aware guard: the pre-return read of LLM_KV_EMBEDDING_LENGTH is now gated on hparams.vocab_only && arch != LLM_ARCH_CLIP && arch != LLM_ARCH_WAVTOKENIZER_DEC, so it only fires for architectures where the key directly represents hparams.n_embd.
Also read LLM_KV_EMBEDDING_LENGTH_OUT (hparams.n_embd_out_impl, optional) in the same guard, since certain models need it even in vocab_only mode.
Use arch member variable instead of calling ml.get_arch() again in the CLIP early-return condition.
No duplicate reads: vocab_only path returns early after reading; non-vocab_only path reads only at its normal location.

Use case

Offline image embedding extraction using only the tokenizer + mmproj (no full model weights). Loading with vocab_only=true is ~6× faster (0.88s vs 5.4s) and uses no weight memory, while producing identical embeddings.

For a practical example of embedding decomposition and offline mmproj loading, see: https://gist.github.com/ChenYFan/ee5b0441e857b09135c3b2269c88f3a6

Requirements

I have read and agree with the contributing guidelines

Move LLM_KV_EMBEDDING_LENGTH read before the vocab_only/CLIP early
return so llama_model_n_embd_inp() returns the correct value even
with vocab_only=true. Without this, mmproj init fails with
"mismatch between text model (n_embd=0) and mmproj (n_embd=5120)".
Copilot AI review requested due to automatic review settings April 25, 2026 08:28
@ChenYFan ChenYFan requested a review from CISC as a code owner April 25, 2026 08:28
@ggml-gh-bot

ggml-gh-bot Bot commented Apr 25, 2026

Copy link
Copy Markdown

Hi @ChenYFan, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts llama_model::load_hparams() so hparams.n_embd is available even when loading a model with vocab_only=true, which is needed for mtmd/mmproj initialization paths that query llama_model_n_embd_inp().

Changes:

  • Reads LLM_KV_EMBEDDING_LENGTH before the vocab_only / LLM_ARCH_CLIP early return in load_hparams().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/llama-model.cpp Outdated
Comment on lines +723 to +725
// read n_embd before vocab_only early return, needed by mmproj
ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false);

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a previously-uninitialized n_embd in vocab_only mode, but there doesn't appear to be a regression test that asserts llama_model_n_embd_inp() is non-zero after a vocab_only=true load. Adding a small assertion in an existing vocab-only test (e.g. tokenizer or llguidance tests) would prevent future regressions of the mmproj initialization issue.

Copilot uses AI. Check for mistakes.
Comment thread src/llama-model.cpp Outdated
Comment on lines 723 to 728
// read n_embd before vocab_only early return, needed by mmproj
ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false);

// everything past this point is not vocab-related
// for CLIP models, we only need to load tensors, no hparams
if (hparams.vocab_only || ml.get_arch() == LLM_ARCH_CLIP) {

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This eagerly reads LLM_KV_EMBEDDING_LENGTH for all architectures, but some arch-specific paths later intentionally repurpose/override hparams.n_embd (e.g. LLM_ARCH_WAVTOKENIZER_DEC sets n_embd from LLM_KV_FEATURES_LENGTH and uses LLM_KV_EMBEDDING_LENGTH for n_embd_out_impl). In vocab_only mode the early return prevents those overrides, so hparams.n_embd can become incorrect for those architectures. Consider limiting this read to the vocab_only case and/or making it architecture-aware (e.g. skip/adjust for LLM_ARCH_CLIP and LLM_ARCH_WAVTOKENIZER_DEC) to avoid changing semantics outside the mmproj/text-model use case.

Suggested change
// read n_embd before vocab_only early return, needed by mmproj
ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false);
// everything past this point is not vocab-related
// for CLIP models, we only need to load tensors, no hparams
if (hparams.vocab_only || ml.get_arch() == LLM_ARCH_CLIP) {
const auto arch = ml.get_arch();
// read n_embd before vocab_only early return only for architectures
// where LLM_KV_EMBEDDING_LENGTH directly represents hparams.n_embd
// (needed by mmproj/text-model vocab-only loading)
if (hparams.vocab_only &&
arch != LLM_ARCH_CLIP &&
arch != LLM_ARCH_WAVTOKENIZER_DEC) {
ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false);
}
// everything past this point is not vocab-related
// for CLIP models, we only need to load tensors, no hparams
if (hparams.vocab_only || arch == LLM_ARCH_CLIP) {

Copilot uses AI. Check for mistakes.
@CISC

CISC commented Apr 25, 2026

Copy link
Copy Markdown
Member

In your use case you would also need to read LLM_KV_EMBEDDING_LENGTH_OUT for certain models, also as pointed out by Copilot may be a problem for wavtokenizer?

@CISC CISC requested a review from ngxson April 25, 2026 11:20

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metadata is now loaded twice, I don't think it's the expected behavior

@ChenYFan

Copy link
Copy Markdown
Author

In your use case you would also need to read LLM_KV_EMBEDDING_LENGTH_OUT for certain models, also as pointed out by Copilot may be a problem for wavtokenizer?

Yes, I may not have taken into account that other models require loading additional fields. I originally designed this architecture specifically for Qwen 3.6's mmproj module; the primary objective was to enable the loading of mmproj solely by loading the vocabulary—without the need to load the entire model. In my project, this allows me to bridge the original embedding configurations, thereby enabling the generation of embeddings without having to load the main model.

@ChenYFan

Copy link
Copy Markdown
Author

the metadata is now loaded twice, I don't think it's the expected behavior

To avoid potential side effects, I decided to opt for the simplest approach: pre-loading the data and subsequently overwriting it. If you feel that this redundant loading is undesirable, I can attempt to eliminate the subsequent loading step. In other words, I would move the entire vocabulary loading process to the very beginning.

@ChenYFan ChenYFan changed the title fix: read n_embd before vocab_only early return for mmproj init fix: architecture-aware n_embd read before vocab_only early return for mmproj init Apr 29, 2026
@ChenYFan

Copy link
Copy Markdown
Author

@CISC @ngxson I have already modified the conditions and LLM_KV_EMBEDDING_LENGTH_OUT according to your suggestions. pls review it agian,tks! #4d962df

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants