Update For Qwen3.5/Gemma4 Support#1356
Conversation
….5 support!) - Removed deprecated native func `llama_adapter_lora_free` and related managed method `LoraAdapter.Unload`
|
Great work on this, Martin! Thank you! I’ve done some testing on Windows with the
So, it seems the GPU implementation is working fine, but there may be an issue with the CPU implementation or the layer partitioning logic. |
|
GPU RTX 3090 Qwen3.5 model seems to be working: No, I found problem! All the code works fine on 0.26.0 for Qwen3-Embedding-0.6B-F16.gguf: Error: That's the problem, but then how do embedders work? Full error log: warn: LLama.LLamaWeights[0] LLama.LLamaWeights: Warning: llama_init_from_model: model default pooling_type is [3], but [1] was specified info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: constructing llama_context info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_seq_max = 1 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ctx = 1024 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ctx_seq = 1024 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_batch = 256 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ubatch = 256 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: causal_attn = 1 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: flash_attn = enabled info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: kv_unified = true info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: freq_base = 1000000.0 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: freq_scale = 1 warn: LLama.LLamaWeights[0] LLama.LLamaWeights: Warning: llama_context: n_ctx_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: set_abort_callback: call info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: CUDA_Host output buffer size = 0.58 MiB dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 0: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 1: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 2: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 3: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 4: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 5: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 6: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 7: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 8: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 9: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 10: dev = CUDA0 LLama.LLamaWeights: Debug: llama_kv_cache: layer 11: dev = CUDA0 dbug: LLama.LLamaWeights[0] dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 12: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 13: dev = CUDA0 LLama.LLamaWeights: Debug: llama_kv_cache: layer 14: dev = CUDA0 dbug: LLama.LLamaWeights[0] dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 15: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 16: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 17: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 18: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 19: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 20: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 21: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 22: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 23: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 24: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 25: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 26: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 27: dev = CUDA0 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_kv_cache: size = 112.00 MiB ( 1024 cells, 28 layers, 1/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: enumerating backends dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: backend_ptrs.size() = 2 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: sched_reserve: reserving ... dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: sched_reserve: max_nodes = 2488 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: sched_reserve: reserving full memory module dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: sched_reserve: worst-case: n_tokens = 256, n_seqs = 1, n_outputs = 1 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: sched_reserve: resolving fused Gated Delta Net support: dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: sched_reserve: fused Gated Delta Net (autoregressive) enabled dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 16, n_seqs = 1, n_outputs = 1 D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml.c:3214: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed This working log for LLamaSharp 0.26.0: warn: LLama.LLamaWeights[0] LLama.LLamaWeights: Warning: llama_init_from_model: model default pooling_type is [3], but [1] was specified info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: constructing llama_context info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_seq_max = 64 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ctx = 1024 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ctx_seq = 1024 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_batch = 256 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ubatch = 256 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: causal_attn = 1 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: flash_attn = enabled info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: kv_unified = true info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: freq_base = 1000000.0 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: freq_scale = 1 warn: LLama.LLamaWeights[0] LLama.LLamaWeights: Warning: llama_context: n_ctx_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: set_abort_callback: call info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: CUDA_Host output buffer size = 37.28 MiB dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 0: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 1: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 2: dev = CUDA0 LLama.LLamaWeights: Debug: llama_kv_cache: layer 3: dev = CUDA0 dbug: LLama.LLamaWeights[0] dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 4: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 5: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 6: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 7: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 8: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 9: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 10: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 11: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 12: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 13: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 14: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 15: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 16: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 17: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 18: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 19: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 20: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 21: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 22: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 23: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 24: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 25: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 26: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 27: dev = CUDA0 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_kv_cache: size = 112.00 MiB ( 1024 cells, 28 layers, 64/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: enumerating backends dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: backend_ptrs.size() = 2 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: max_nodes = 2488 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: reserving full memory module dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: worst-case: n_tokens = 256, n_seqs = 64, n_outputs = 64 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 256, n_seqs = 64, n_outputs = 256 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 64, n_seqs = 64, n_outputs = 64 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 256, n_seqs = 64, n_outputs = 256 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: CUDA0 compute buffer size = 150.43 MiB info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: CUDA_Host compute buffer size = 2.07 MiB info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: graph nodes = 990 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: graph splits = 2 |
|
not sure if this helps, but tested on macos m4 pro, works fine |
|
Is this PR still being worked on? |
|
When I get some time I'm going to try updating it with a more recent version of llama.cpp. I'm hoping that the errors people reported were an issue with the specific version this PR is based on. |
|
I stopped using LLamaPoolingType.Mean. |
Yes, the LLamaPoolingType.Mean must not be defined anymore. That definitely causes a crash and it is a tricky change! |
|
With further testing on Windows + CUDA I get this strange error: |
|
Now there is a ban on context shifting. In my opinion, this is bad, of course. Because in reality it is needed. I suggested a solution earlier. |
@aropb , I am not sure about your proposed solution. I looked at the problem and the llama.cpp implementation (the bouncer) is just reading the model's blueprint. The engine looks at the model, sees that it uses multi-dimensional embeddings, and explicitly disables the shifting feature to protect the application from crashing or outputting gibberish. So, the implementation is what throws the MemoryCanShift == false error, but it does so because the model's fundamental math cannot support the operation. With multi-dimensional positional embeddings (like 2D RoPE), you cannot simply "slide" these without completely corrupting the mathematical relationships between the tokens... With Qwen 3.5 this memory shifting may not be possible. |
Yeah that sounds about right to me.
I think this is probably the only viable way to handle it. Shifting is basically just an optimisation to avoid having to re-run prefill over all of the tokens again. So if shifting isn't supported we should fall back to running prefill over the shifted tokens instead of crashing. That'll have to be a followup PR to this one. Would either of you be interested in developing it? |
|
PR has been updated to a more recent version of llama.cpp |
|
My tests are running fine on:
Note: on some GPU's we need |
Note (17th April): This PR has been updated to a newer binary version that when it was first opened!
Testing: