fix: architecture-aware n_embd read before vocab_only early return for mmproj init by ChenYFan · Pull Request #22348 · ggml-org/llama.cpp

ChenYFan · 2026-04-25T08:28:49Z

Summary

Read LLM_KV_EMBEDDING_LENGTH and LLM_KV_EMBEDDING_LENGTH_OUT before the vocab_only/CLIP early return so llama_model_n_embd_inp() returns the correct value even with vocab_only=true. Without this, mmproj init fails with "mismatch between text model (n_embd=0) and mmproj (n_embd=5120)".

Problem

When loading a model with vocab_only=true and then initializing mmproj via mtmd_init_from_file, llama_model_n_embd_inp() returns 0 because hparams.n_embd is never populated — the vocab_only early return exits before LLM_KV_EMBEDDING_LENGTH is parsed.

The previous fix unconditionally read LLM_KV_EMBEDDING_LENGTH into hparams.n_embd for all architectures before the early return. This introduced two issues:

Incorrect semantics for special architectures — LLM_ARCH_WAVTOKENIZER_DEC intentionally sets n_embd from LM_KV_FEATURES_LENGTH and repurposes LLM_KV_EMBEDDING_LENGTH for n_embd_out_impl. In vocab_only mode the early return prevented these overrides, leaving hparams.n_embd with an incorrect value.
Duplicate metadata read — LLM_KV_EMBEDDING_LENGTH was read once before the early return and again unconditionally after it, resulting in redundant parsing for all non-vocab_only loads.

Changes

Architecture-aware guard: the pre-return read of LLM_KV_EMBEDDING_LENGTH is now gated on hparams.vocab_only && arch != LLM_ARCH_CLIP && arch != LLM_ARCH_WAVTOKENIZER_DEC, so it only fires for architectures where the key directly represents hparams.n_embd.
Also read LLM_KV_EMBEDDING_LENGTH_OUT (hparams.n_embd_out_impl, optional) in the same guard, since certain models need it even in vocab_only mode.
Use arch member variable instead of calling ml.get_arch() again in the CLIP early-return condition.
No duplicate reads: vocab_only path returns early after reading; non-vocab_only path reads only at its normal location.

Use case

Offline image embedding extraction using only the tokenizer + mmproj (no full model weights). Loading with vocab_only=true is ~6× faster (0.88s vs 5.4s) and uses no weight memory, while producing identical embeddings.

For a practical example of embedding decomposition and offline mmproj loading, see: https://gist.github.com/ChenYFan/ee5b0441e857b09135c3b2269c88f3a6

Requirements

I have read and agree with the contributing guidelines

Move LLM_KV_EMBEDDING_LENGTH read before the vocab_only/CLIP early return so llama_model_n_embd_inp() returns the correct value even with vocab_only=true. Without this, mmproj init fails with "mismatch between text model (n_embd=0) and mmproj (n_embd=5120)".

ggml-gh-bot · 2026-04-25T08:32:52Z

Hi @ChenYFan, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Copilot

Pull request overview

Adjusts llama_model::load_hparams() so hparams.n_embd is available even when loading a model with vocab_only=true, which is needed for mtmd/mmproj initialization paths that query llama_model_n_embd_inp().

Changes:

Reads LLM_KV_EMBEDDING_LENGTH before the vocab_only / LLM_ARCH_CLIP early return in load_hparams().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T08:34:05Z

+    // read n_embd before vocab_only early return, needed by mmproj
+    ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false);
+


This change fixes a previously-uninitialized n_embd in vocab_only mode, but there doesn't appear to be a regression test that asserts llama_model_n_embd_inp() is non-zero after a vocab_only=true load. Adding a small assertion in an existing vocab-only test (e.g. tokenizer or llguidance tests) would prevent future regressions of the mmproj initialization issue.

Copilot · 2026-04-25T08:34:05Z

+    // read n_embd before vocab_only early return, needed by mmproj
+    ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false);
+
    // everything past this point is not vocab-related
    // for CLIP models, we only need to load tensors, no hparams
    if (hparams.vocab_only || ml.get_arch() == LLM_ARCH_CLIP) {


This eagerly reads LLM_KV_EMBEDDING_LENGTH for all architectures, but some arch-specific paths later intentionally repurpose/override hparams.n_embd (e.g. LLM_ARCH_WAVTOKENIZER_DEC sets n_embd from LLM_KV_FEATURES_LENGTH and uses LLM_KV_EMBEDDING_LENGTH for n_embd_out_impl). In vocab_only mode the early return prevents those overrides, so hparams.n_embd can become incorrect for those architectures. Consider limiting this read to the vocab_only case and/or making it architecture-aware (e.g. skip/adjust for LLM_ARCH_CLIP and LLM_ARCH_WAVTOKENIZER_DEC) to avoid changing semantics outside the mmproj/text-model use case.

Suggested change

// read n_embd before vocab_only early return, needed by mmproj

ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false);

// everything past this point is not vocab-related

// for CLIP models, we only need to load tensors, no hparams

if (hparams.vocab_only || ml.get_arch() == LLM_ARCH_CLIP) {

const auto arch = ml.get_arch();

// read n_embd before vocab_only early return only for architectures

// where LLM_KV_EMBEDDING_LENGTH directly represents hparams.n_embd

// (needed by mmproj/text-model vocab-only loading)

if (hparams.vocab_only &&

arch != LLM_ARCH_CLIP &&

arch != LLM_ARCH_WAVTOKENIZER_DEC) {

ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false);

}

// everything past this point is not vocab-related

// for CLIP models, we only need to load tensors, no hparams

if (hparams.vocab_only || arch == LLM_ARCH_CLIP) {

CISC · 2026-04-25T11:20:21Z

In your use case you would also need to read LLM_KV_EMBEDDING_LENGTH_OUT for certain models, also as pointed out by Copilot may be a problem for wavtokenizer?

ngxson

the metadata is now loaded twice, I don't think it's the expected behavior

ChenYFan · 2026-04-29T05:31:39Z

In your use case you would also need to read LLM_KV_EMBEDDING_LENGTH_OUT for certain models, also as pointed out by Copilot may be a problem for wavtokenizer?

Yes, I may not have taken into account that other models require loading additional fields. I originally designed this architecture specifically for Qwen 3.6's mmproj module; the primary objective was to enable the loading of mmproj solely by loading the vocabulary—without the need to load the entire model. In my project, this allows me to bridge the original embedding configurations, thereby enabling the generation of embeddings without having to load the main model.

ChenYFan · 2026-04-29T05:33:38Z

the metadata is now loaded twice, I don't think it's the expected behavior

To avoid potential side effects, I decided to opt for the simplest approach: pre-loading the data and subsequently overwriting it. If you feel that this redundant loading is undesirable, I can attempt to eliminate the subsequent loading step. In other words, I would move the entire vocabulary loading process to the very beginning.

…s hparams.n_embd

ChenYFan · 2026-04-29T06:24:04Z

@CISC @ngxson I have already modified the conditions and LLM_KV_EMBEDDING_LENGTH_OUT according to your suggestions. pls review it agian,tks! #4d962df

Copilot AI review requested due to automatic review settings April 25, 2026 08:28

ChenYFan requested a review from CISC as a code owner April 25, 2026 08:28

Copilot started reviewing on behalf of ChenYFan April 25, 2026 08:29 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

CISC requested a review from ngxson April 25, 2026 11:20

ngxson requested changes Apr 25, 2026

View reviewed changes

ChenYFan changed the title ~~fix: read n_embd before vocab_only early return for mmproj init~~ fix: architecture-aware n_embd read before vocab_only early return for mmproj init Apr 29, 2026

ChenYFan and others added 2 commits April 29, 2026 14:14

Merge branch 'ggml-org:master' into master

fa41210

fix: it only fires for architectures where the key directly represent…

f5b8467

…s hparams.n_embd

ChenYFan added 3 commits May 9, 2026 02:19

Merge branch 'ggml-org:master' into master

8c0f220

Merge branch 'ggml-org:master' into master

f1e24d0

Merge branch 'ggml-org:master' into master

dd9fa3a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: architecture-aware n_embd read before vocab_only early return for mmproj init#22348

fix: architecture-aware n_embd read before vocab_only early return for mmproj init#22348
ChenYFan wants to merge 6 commits into
ggml-org:masterfrom
ChenYFan:master

ChenYFan commented Apr 25, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

CISC commented Apr 25, 2026

Uh oh!

ngxson left a comment

Uh oh!

ChenYFan commented Apr 29, 2026

Uh oh!

ChenYFan commented Apr 29, 2026

Uh oh!

ChenYFan commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// read n_embd before vocab_only early return, needed by mmproj
		ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd, false);

Conversation

ChenYFan commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Use case

Requirements

Uh oh!

ggml-gh-bot Bot commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

CISC commented Apr 25, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ChenYFan commented Apr 29, 2026

Uh oh!

ChenYFan commented Apr 29, 2026

Uh oh!

ChenYFan commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ChenYFan commented Apr 25, 2026 •

edited

Loading