Update For Qwen3.5/Gemma4 Support by martindevans · Pull Request #1356 · SciSharp/LLamaSharp

martindevans · 2026-03-14T14:31:09Z

Updated binaries to 3f7c29d318e317b63f54c558bc69803963d7d88c (Adds Qwen35 and gemma4 support)

Note (17th April): This PR has been updated to a newer binary version that when it was first opened!

Testing:

Windows CUDA
Windows Vulkan
Linux CUDA
Linux Vulkan

….5 support!) - Removed deprecated native func `llama_adapter_lora_free` and related managed method `LoraAdapter.Unload`

zsogitbe · 2026-03-15T07:11:57Z

Great work on this, Martin! Thank you!

I’ve done some testing on Windows with the MtmdInteractiveModeExecute example and Qwen 3.5. Here are my findings:

Strange Layer Distribution: The default settings don't seem to work correctly. I’m seeing a very odd distribution of layers across the CPU and two GPUs (instead of just one):

[llama Debug]: llama_kv_cache: layer   0: filtered
[llama Debug]: llama_kv_cache: layer   1: filtered
[llama Debug]: llama_kv_cache: layer   2: filtered
[llama Debug]: llama_kv_cache: layer   3: dev = CPU
[llama Debug]: llama_kv_cache: layer   4: filtered
[llama Debug]: llama_kv_cache: layer   5: filtered
[llama Debug]: llama_kv_cache: layer   6: filtered
[llama Debug]: llama_kv_cache: layer   7: dev = CPU
[llama Debug]: llama_kv_cache: layer   8: filtered
[llama Debug]: llama_kv_cache: layer   9: filtered
[llama Debug]: llama_kv_cache: layer  10: filtered
[llama Debug]: llama_kv_cache: layer  11: dev = CPU
[llama Debug]: llama_kv_cache: layer  12: filtered
[llama Debug]: llama_kv_cache: layer  13: filtered
[llama Debug]: llama_kv_cache: layer  14: filtered
[llama Debug]: llama_kv_cache: layer  15: dev = CUDA0
[llama Debug]: llama_kv_cache: layer  16: filtered
[llama Debug]: llama_kv_cache: layer  17: filtered
[llama Debug]: llama_kv_cache: layer  18: filtered
[llama Debug]: llama_kv_cache: layer  19: dev = CUDA0
[llama Debug]: llama_kv_cache: layer  20: filtered
[llama Debug]: llama_kv_cache: layer  21: filtered
[llama Debug]: llama_kv_cache: layer  22: filtered
[llama Debug]: llama_kv_cache: layer  23: dev = CUDA0
[llama Debug]: llama_kv_cache: layer  24: filtered
[llama Debug]: llama_kv_cache: layer  25: filtered
[llama Debug]: llama_kv_cache: layer  26: filtered
[llama Debug]: llama_kv_cache: layer  27: dev = CUDA0
[llama Debug]: llama_kv_cache: layer  28: filtered
[llama Debug]: llama_kv_cache: layer  29: filtered
[llama Debug]: llama_kv_cache: layer  30: filtered
[llama Debug]: llama_kv_cache: layer  31: dev = CUDA1

Corrupted Output: With the default settings, the model produces "endless garbage thinking" (distorted output).
CPU Issues: When forcing CPU-only mode, I get the same "garbage" output, and performance is very slow.
GPU Success: When forcing GPU 0 only (n_gpu_layers = -1), the output is correct and works as expected.

So, it seems the GPU implementation is working fine, but there may be an issue with the CPU implementation or the layer partitioning logic.

aropb · 2026-03-15T16:35:32Z

@martindevans

GPU RTX 3090
Windows 11

Qwen3.5 model seems to be working:
https://huggingface.co/bartowski/Qwen_Qwen3.5-9B-GGUF
Qwen_Qwen3.5-9B-Q8_0.gguf
mmproj-Qwen_Qwen3.5-9B-bf16.gguf

No, I found problem!

All the code works fine on 0.26.0 for Qwen3-Embedding-0.6B-F16.gguf:
Context = weights.CreateContext(modelParams, logger);

Error:
D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml.c:3214: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

https://github.com/ggml-org/llama.cpp/blob/ceef6b5233c3b31f454632c48fb42af16944bc5b/ggml/src/ggml.c#L3214

That's the problem, but then how do embedders work?
modelParams.PoolingType = LLamaPoolingType.Mean;

Full error log:

warn: LLama.LLamaWeights[0]
llama_init_from_model: model default pooling_type is [3], but [1] was specified

LLama.LLamaWeights: Warning: llama_init_from_model: model default pooling_type is [3], but [1] was specified

info: LLama.LLamaWeights[0]
llama_context: constructing llama_context

LLama.LLamaWeights: Information: llama_context: constructing llama_context

info: LLama.LLamaWeights[0]
llama_context: n_seq_max = 1

LLama.LLamaWeights: Information: llama_context: n_seq_max = 1

info: LLama.LLamaWeights[0]
llama_context: n_ctx = 1024

LLama.LLamaWeights: Information: llama_context: n_ctx = 1024

info: LLama.LLamaWeights[0]
llama_context: n_ctx_seq = 1024

LLama.LLamaWeights: Information: llama_context: n_ctx_seq = 1024

info: LLama.LLamaWeights[0]
llama_context: n_batch = 256

LLama.LLamaWeights: Information: llama_context: n_batch = 256

info: LLama.LLamaWeights[0]
llama_context: n_ubatch = 256

LLama.LLamaWeights: Information: llama_context: n_ubatch = 256

info: LLama.LLamaWeights[0]
llama_context: causal_attn = 1

LLama.LLamaWeights: Information: llama_context: causal_attn = 1

info: LLama.LLamaWeights[0]
llama_context: flash_attn = enabled

LLama.LLamaWeights: Information: llama_context: flash_attn = enabled

info: LLama.LLamaWeights[0]
llama_context: kv_unified = true

LLama.LLamaWeights: Information: llama_context: kv_unified = true

info: LLama.LLamaWeights[0]
llama_context: freq_base = 1000000.0

LLama.LLamaWeights: Information: llama_context: freq_base = 1000000.0

info: LLama.LLamaWeights[0]
llama_context: freq_scale = 1

LLama.LLamaWeights: Information: llama_context: freq_scale = 1

warn: LLama.LLamaWeights[0]
llama_context: n_ctx_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized

LLama.LLamaWeights: Warning: llama_context: n_ctx_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized

dbug: LLama.LLamaWeights[0]
set_abort_callback: call

LLama.LLamaWeights: Debug: set_abort_callback: call

info: LLama.LLamaWeights[0]
llama_context: CUDA_Host output buffer size = 0.58 MiB

LLama.LLamaWeights: Information: llama_context: CUDA_Host output buffer size = 0.58 MiB

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 0: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 0: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 1: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 1: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 2: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 2: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 3: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 3: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 4: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 4: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 5: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 5: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 6: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 6: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 7: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 7: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 8: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 8: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 9: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 9: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 10: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 10: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 11: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 11: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 12: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 12: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 13: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 13: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 14: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 14: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 15: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 15: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 16: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 16: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 17: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 17: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 18: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 18: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 19: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 19: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 20: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 20: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 21: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 21: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 22: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 22: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 23: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 23: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 24: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 24: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 25: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 25: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 26: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 26: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 27: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 27: dev = CUDA0

info: LLama.LLamaWeights[0]
llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB

LLama.LLamaWeights: Information: llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB

info: LLama.LLamaWeights[0]
llama_kv_cache: size = 112.00 MiB ( 1024 cells, 28 layers, 1/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB

LLama.LLamaWeights: Information: llama_kv_cache: size = 112.00 MiB ( 1024 cells, 28 layers, 1/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB

dbug: LLama.LLamaWeights[0]
llama_context: enumerating backends

LLama.LLamaWeights: Debug: llama_context: enumerating backends

dbug: LLama.LLamaWeights[0]
llama_context: backend_ptrs.size() = 2

LLama.LLamaWeights: Debug: llama_context: backend_ptrs.size() = 2

info: LLama.LLamaWeights[0]
sched_reserve: reserving ...

LLama.LLamaWeights: Information: sched_reserve: reserving ...

dbug: LLama.LLamaWeights[0]
sched_reserve: max_nodes = 2488

LLama.LLamaWeights: Debug: sched_reserve: max_nodes = 2488

dbug: LLama.LLamaWeights[0]
sched_reserve: reserving full memory module

LLama.LLamaWeights: Debug: sched_reserve: reserving full memory module

dbug: LLama.LLamaWeights[0]
sched_reserve: worst-case: n_tokens = 256, n_seqs = 1, n_outputs = 1

LLama.LLamaWeights: Debug: sched_reserve: worst-case: n_tokens = 256, n_seqs = 1, n_outputs = 1

info: LLama.LLamaWeights[0]
sched_reserve: resolving fused Gated Delta Net support:

LLama.LLamaWeights: Information: sched_reserve: resolving fused Gated Delta Net support:

dbug: LLama.LLamaWeights[0]
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1

LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1

info: LLama.LLamaWeights[0]
sched_reserve: fused Gated Delta Net (autoregressive) enabled

LLama.LLamaWeights: Information: sched_reserve: fused Gated Delta Net (autoregressive) enabled

dbug: LLama.LLamaWeights[0]
graph_reserve: reserving a graph for ubatch with n_tokens = 16, n_seqs = 1, n_outputs = 1

LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 16, n_seqs = 1, n_outputs = 1

D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml.c:3214: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

This working log for LLamaSharp 0.26.0: