rocm: fix gfx1151 correctness (precise expert-selection math), enable indexer, enable agent by alantsev · Pull Request #311 · antirez/ds4

alantsev · 2026-05-31T14:48:01Z

rocm - fixing the correctness issues

(rebased on main@upstream)

enable indexer
use precise math functions on the expert selection path
use warp sum to improve generation speed (8+ t/s -> 11+ t/s)
enable agent for the rocm build

the test run

$ ./ds4_test

long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.451s
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
long-context: OK
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.346s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: OK
local-golden-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s
ds4: cuda backend initialized for graph diagnostics
ds4-test: local golden long_story_4096 top1 ref=4371 cand=4371 top5_overlap=4/5 top20_overlap=17/20 top64_overlap=55/64 top20_max_abs=1.67544
local-golden-vectors: OK
metal-short-prefill:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s
ds4: cuda backend initialized for graph diagnostics
metal-short-prefill: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
ds4: CUDA registered 0.00 GiB model mapping for device access
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
metal-tensor-equivalence:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s
ds4: cuda backend initialized for graph diagnostics
ds4-test: Tensor equivalence candidate route=auto
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s
ds4: cuda backend initialized for graph diagnostics
ds4-test: Tensor equivalence short_italian_fact top1 ref=108149 cand=108149 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_italian_fact largest deltas: id=0 ref=-15.0155 cand=-15.0155 abs=0 id=1 ref=19.9358 cand=19.9358 abs=0 id=2 ref=-55.9084 cand=-55.9084 abs=0 id=3 ref=17.8982 cand=17.8982 abs=0 id=4 ref=26.
0747 cand=26.0747 abs=0
ds4-test: Tensor equivalence short_code_completion top1 ref=9854 cand=9854 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_code_completion largest deltas: id=0 ref=-2.66161 cand=-2.66161 abs=0 id=1 ref=21.3162 cand=21.3162 abs=0 id=2 ref=-45.7824 cand=-45.7824 abs=0 id=3 ref=10.9651 cand=10.9651 abs=0 id=4 ref=
25.8229 cand=25.8229 abs=0
ds4-test: Tensor equivalence short_reasoning_plain top1 ref=926 cand=926 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_reasoning_plain largest deltas: id=0 ref=-3.03494 cand=-3.03494 abs=0 id=1 ref=23.3849 cand=23.3849 abs=0 id=2 ref=-42.7991 cand=-42.7991 abs=0 id=3 ref=16.0927 cand=16.0927 abs=0 id=4 ref=
18.5051 cand=18.5051 abs=0
ds4-test: Tensor equivalence long_memory_archive top1 ref=32111 cand=32111 top5_overlap=4/5 overlap=18/20 max_rank_delta=4 rms=0.582494 max_abs=3.37462 top20_max_abs=1.52215
ds4-test: Tensor equivalence long_memory_archive largest deltas: id=126759 ref=-20.0995 cand=-23.4742 abs=3.37462 id=25690 ref=-20.7083 cand=-17.7369 abs=2.97143 id=61664 ref=-2.01424 cand=0.917232 abs=2.93147 id=78827 ref=-
22.8887 cand=-25.7399 abs=2.8512 id=71753 ref=4.75498 cand=7.58581 abs=2.83083
ds4-test: Tensor equivalence long_code_audit top1 ref=671 cand=671 top5_overlap=5/5 overlap=18/20 max_rank_delta=3 rms=0.433541 max_abs=2.06283 top20_max_abs=1.08788
ds4-test: Tensor equivalence long_code_audit largest deltas: id=113847 ref=-11.6684 cand=-13.7312 abs=2.06283 id=123327 ref=-16.5266 cand=-18.47 abs=1.94345 id=113254 ref=-2.13925 cand=-4.08151 abs=1.94226 id=103699 ref=-0.1
64321 cand=-2.08601 abs=1.92169 id=56957 ref=-6.75335 cand=-4.83732 abs=1.91603
ds4-test: Tensor summary route=auto cases=5 capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 min_top5_overlap=4/5 min_overlap=18/20 worst_rank_delta=4 worst_rms=0.582494 worst_max_abs=3.37462 worst_top20_max_abs=1.
52215
metal-tensor-equivalence: OK
server:
server: OK
ds4 tests: ok

the determenistic test run (with DS4_CUDA_MOE_NO_ATOMIC_DOWN=1)

$ DS4_CUDA_MOE_NO_ATOMIC_DOWN=1 ./ds4_test

long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.315s
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
^[^[ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
long-context: OK
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: OK
local-golden-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s
ds4: cuda backend initialized for graph diagnostics
ds4-test: local golden long_story_4096 top1 ref=4371 cand=4371 top5_overlap=4/5 top20_overlap=16/20 top64_overlap=52/64 top20_max_abs=2.3658
local-golden-vectors: OK
metal-short-prefill:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s
ds4: cuda backend initialized for graph diagnostics
metal-short-prefill: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
ds4: CUDA registered 0.00 GiB model mapping for device access
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
metal-tensor-equivalence:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.346s
ds4: cuda backend initialized for graph diagnostics
ds4-test: Tensor equivalence candidate route=auto
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.346s
ds4: cuda backend initialized for graph diagnostics
ds4-test: Tensor equivalence short_italian_fact top1 ref=108149 cand=108149 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_italian_fact largest deltas: id=0 ref=-15.0155 cand=-15.0155 abs=0 id=1 ref=19.9358 cand=19.9358 abs=0 id=2 ref=-55.9084 cand=-55.9084 abs=0 id=3 ref=17.8982 ca
nd=17.8982 abs=0 id=4 ref=26.0747 cand=26.0747 abs=0
ds4-test: Tensor equivalence short_code_completion top1 ref=9854 cand=9854 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_code_completion largest deltas: id=0 ref=-2.66161 cand=-2.66161 abs=0 id=1 ref=21.3162 cand=21.3162 abs=0 id=2 ref=-45.7824 cand=-45.7824 abs=0 id=3 ref=10.9651
 cand=10.9651 abs=0 id=4 ref=25.8229 cand=25.8229 abs=0
ds4-test: Tensor equivalence short_reasoning_plain top1 ref=926 cand=926 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_reasoning_plain largest deltas: id=0 ref=-3.03494 cand=-3.03494 abs=0 id=1 ref=23.3849 cand=23.3849 abs=0 id=2 ref=-42.7991 cand=-42.7991 abs=0 id=3 ref=16.0927
 cand=16.0927 abs=0 id=4 ref=18.5051 cand=18.5051 abs=0
ds4-test: Tensor equivalence long_memory_archive top1 ref=32111 cand=32111 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence long_memory_archive largest deltas: id=0 ref=-8.42831 cand=-8.42831 abs=0 id=1 ref=19.284 cand=19.284 abs=0 id=2 ref=-50.653 cand=-50.653 abs=0 id=3 ref=10.6968 cand=
10.6968 abs=0 id=4 ref=21.0302 cand=21.0302 abs=0
ds4-test: Tensor equivalence long_code_audit top1 ref=671 cand=671 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence long_code_audit largest deltas: id=0 ref=-4.50487 cand=-4.50487 abs=0 id=1 ref=19.7669 cand=19.7669 abs=0 id=2 ref=-47.0626 cand=-47.0626 abs=0 id=3 ref=16.7405 cand=
16.7405 abs=0 id=4 ref=23.0197 cand=23.0197 abs=0
ds4-test: Tensor summary route=auto cases=5 capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 min_top5_overlap=5/5 min_overlap=20/20 worst_rank_delta=0 worst_rms=0 worst_max_abs=0 worst_
top20_max_abs=0
metal-tensor-equivalence: OK
server:
server: OK
ds4 tests: ok

the benchmark run

$ ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.348s
ds4: cuda backend initialized for graph diagnostics
ds4-bench: context buffers 1742.43 MiB (ctx=65665, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=16418)
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,66.87,128,11.79,52184460
4096,2048,65.46,128,10.13,80373132
6144,2048,65.30,128,10.07,108561804
8192,2048,65.23,128,9.98,136750476
...

the eval run (until the first failure)

$ ./ds4-eval --nothink --temp 3 --min-p 0.25
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.314s
ds4: cuda backend initialized for graph diagnostics
ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000)
ds4-eval: model shape DeepSeek V4 Flash
ds4-eval: context buffers 718.95 MiB (ctx=16777, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=4196)
ds4-eval: 14/92 passed, 1 failed, runtime 00h:12m
#   state      prompt      gen    total given    correct  test
  1 PASSED        201      462      663 B        B        GPQA Diamond/recNu3MXkvWUzHZr9
  2 PASSED        149       64      213 C        C        SuperGPQA/001b51d76b4d422988f2c11f104a2c6c
  3 PASSED         81      408      489 70       70       AIME2025/aime2025-01
  4 PASSED        313      170      483 C        C        GPQA Diamond/recoiTJPGUmzAkief
  5 PASSED        272      372      644 J        J        SuperGPQA/b7e20eac98764fb0bf30e8366d951daa
  6 PASSED        146      788      934 468      468      AIME2025/aime2025-16
  7 PASSED        156      496      652 B        B        GPQA Diamond/rec4UqStf9WUVif1f
  8 PASSED        127       55      182 E        E        SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8
  9 PASSED        633     2537     3170 588      588      AIME2025/aime2025-02
 10 PASSED        182      382      564 B        B        GPQA Diamond/recgI6tUQ7RLJRWGx
 11 PASSED        137       85      222 A        A        SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09
 12 PASSED        165      612      777 16       16       AIME2025/aime2025-03
 13 PASSED        149      901     1050 A        A        GPQA Diamond (modified)/recDytVnNYZe2HuUU
 14 PASSED        167       76      243 J        J        SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a
 15 FAILED        305      880     1185 96       82       AIME2025/aime2025-18
 16 STOPPED       131       14      145 -        D        GPQA Diamond/recNFJjE5PPTqVJGv
 17 PENDING         0        0        0 -        I        SuperGPQA/7ca71b86327744b78e93185a45bc5cef
...

Implements the Responses API endpoint that Codex CLI (and other modern OpenAI tooling) speaks instead of /v1/chat/completions. The wire format is documented in OpenAI's Responses API; this implementation has been iterated against the Codex CLI binary's SSE parser shape until no remaining schema gaps were found. Request parsing (parse_responses_request, parse_responses_input): - Accepts the typed input array (message, function_call, function_call_output, reasoning, custom_tool_call(_output), local_shell_call(_output), web_search_call(_output), tool_search_call(_output), image_generation_call(_output), compaction, context_compaction). - Maps hosted-tool history to function_call/function_call_output so prior actions survive across turns; rejects unknown item types and non-completed status with 400 to avoid silent context loss. - Strict content-array parsing: only string|null|array of recognized text blocks (input_text/output_text/text/summary_text/ reasoning_text); rejects non-text modalities (input_image/file/ audio) instead of accepting an empty prompt. - Merges adjacent function_call items into the preceding assistant message so text + tool-call turns render as a single assistant block. - Honors reasoning.effort (incl. "minimal"/"none") and gates reasoning summary surface on reasoning.summary opt-in. - Rejects previous_response_id, conversation, and forced tool_choice explicitly (constrained decoding / persisted state not supported). Output (responses_sse_*, responses_final_response): - Emits the full streaming lifecycle: response.created, output_item.added/.done, reasoning_summary_part.added/.done, reasoning_summary_text.delta/.done, content_part.added/.done, output_text.delta/.done, function_call_arguments.delta/.done, response.completed. - Branches the terminal event by finish reason: response.failed for errors and response.incomplete with reason "max_tokens" for length. - Every event carries sequence_number; every output_text part carries annotations:[]; function_call output_item.added ships with an empty arguments string (full args arrive via function_call_arguments.done and output_item.done), and item ids are stable across added/done. - Tracks whether </think> was actually observed so a truncated stream marks the reasoning item incomplete instead of "completed". - Recovers gracefully when the DSML tool parse fails after the model was suppressed at the tool marker: the suppressed tail is flushed as additional output_text deltas so the streamed message matches output_item.done. Tested by 25 rounds of /codex:adversarial-review against the same client this is meant to feed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks. Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.

Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.

Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.

Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.

Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation. The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance. Tested on 0.180 with: - make cpu - make -B cuda-spark - make cuda-regression - ./ds4_test --server --metal-kernels - ./ds4_test --logprob-vectors --tool-call-quality - ds4-bench ctx-alloc 32768, 250000, and 1000000 - ds4-server --ctx 1000000 startup smoke (cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)

Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first. Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations. Fixes antirez#127.

Return a 400 error with error type "context_exceeded" when prompt tokens exceed context size. The response includes both n_prompt_tokens and n_ctx fields so clients can determine exactly why the request failed and how far over the limit they went. Error response format: { "error": { "message": "Prompt tokens (N) exceeds context size (M)", "type": "context_exceeded", "n_prompt_tokens": N, "n_ctx": M } }

dwarfstar is typoed to drawfstar

fix typo in readme

CPU support for Q4_K routed experts (fixes antirez#171)

harryhaaren · 2026-06-01T18:40:24Z

Tested this branch with a 7900xtx (RDNA3), output is correct, speed is slow (as a single 24GB GPU with 128GB DRAM on Zen2 threadripper CPU, this is expected): ds4: prefill: 3.61 t/s, generation: 1.63 t/s (with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf )

Running the tests will take a bit too long in general I think... but happy to spot-check a specific test if that helps.

alantsev · 2026-06-01T22:19:16Z

Thanks @harryhaaren , can you please run it with DS4_ROCM_NO_F16_PAIR_WARP_MATMUL=1 option? This is the only change I made in the kernel logic - this option will turn it off, so you will be running the original matmul_f16_pair_ordered_chunks_kernel kernel. Thanks again.

alantsev · 2026-06-01T22:39:21Z

btw, the whole codebase (i.e. the ds4_cuda kernels) implies that you have a unified memory architecture.

Metal execution Add the PRO Q4 routed-expert Metal path and distributed runtime changes needed to run the full model across two hosts. Sliced GGUF loading Allow distributed engines to bind only their local layer range, with token embeddings and the output head required only on the sides that need them. Split artifacts Document the two-file PRO Q4 setup and add download_model.sh targets for the coordinator half, worker half, and combined split download. (cherry picked from commit a782cfba894c6a44af11e1b5fc69ccfc000ab39d)

alantsev · 2026-06-05T14:08:32Z

rebased on top of the current main

- enable indexer - use precise math functions on the expert selection path - enable agent for the rocm build the test run ``` $ ./ds4_test long-context: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.451s ds4: cuda backend initialized for graph diagnostics ds4-test: long-context prefill 0/30474 ds4-test: long-context prefill 8192/30474 ds4-test: long-context prefill 16384/30474 ds4-test: long-context prefill 24576/30474 ds4-test: long-context prefill 30474/30474 long-context: OK tool-call-quality: ds4-test: tool-call quality fast path ds4-test: tool-call quality exact path ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics tool-call-quality: OK logprob-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.346s ds4: cuda backend initialized for graph diagnostics ds4-test: vector short_italian_fact ds4-test: vector short_code_completion ds4-test: vector short_reasoning_plain ds4-test: vector long_memory_archive skipped (API/official graph mismatch) ds4-test: vector long_code_audit logprob-vectors: OK local-golden-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics ds4-test: local golden long_story_4096 top1 ref=4371 cand=4371 top5_overlap=4/5 top20_overlap=17/20 top64_overlap=55/64 top20_max_abs=1.67544 local-golden-vectors: OK metal-short-prefill: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics metal-short-prefill: OK metal-kernels: ds4: CUDA registered 0.00 GiB model mapping for device access ds4: CUDA registered 0.00 GiB model mapping for device access ds4: CUDA registered 0.00 GiB model mapping for device access metal-kernels: OK metal-tensor-equivalence: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics ds4-test: Tensor equivalence candidate route=auto ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics ds4-test: Tensor equivalence short_italian_fact top1 ref=108149 cand=108149 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_italian_fact largest deltas: id=0 ref=-15.0155 cand=-15.0155 abs=0 id=1 ref=19.9358 cand=19.9358 abs=0 id=2 ref=-55.9084 cand=-55.9084 abs=0 id=3 ref=17.8982 cand=17.8982 abs=0 id=4 ref=26. 0747 cand=26.0747 abs=0 ds4-test: Tensor equivalence short_code_completion top1 ref=9854 cand=9854 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_code_completion largest deltas: id=0 ref=-2.66161 cand=-2.66161 abs=0 id=1 ref=21.3162 cand=21.3162 abs=0 id=2 ref=-45.7824 cand=-45.7824 abs=0 id=3 ref=10.9651 cand=10.9651 abs=0 id=4 ref= 25.8229 cand=25.8229 abs=0 ds4-test: Tensor equivalence short_reasoning_plain top1 ref=926 cand=926 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_reasoning_plain largest deltas: id=0 ref=-3.03494 cand=-3.03494 abs=0 id=1 ref=23.3849 cand=23.3849 abs=0 id=2 ref=-42.7991 cand=-42.7991 abs=0 id=3 ref=16.0927 cand=16.0927 abs=0 id=4 ref= 18.5051 cand=18.5051 abs=0 ds4-test: Tensor equivalence long_memory_archive top1 ref=32111 cand=32111 top5_overlap=4/5 overlap=18/20 max_rank_delta=4 rms=0.582494 max_abs=3.37462 top20_max_abs=1.52215 ds4-test: Tensor equivalence long_memory_archive largest deltas: id=126759 ref=-20.0995 cand=-23.4742 abs=3.37462 id=25690 ref=-20.7083 cand=-17.7369 abs=2.97143 id=61664 ref=-2.01424 cand=0.917232 abs=2.93147 id=78827 ref=- 22.8887 cand=-25.7399 abs=2.8512 id=71753 ref=4.75498 cand=7.58581 abs=2.83083 ds4-test: Tensor equivalence long_code_audit top1 ref=671 cand=671 top5_overlap=5/5 overlap=18/20 max_rank_delta=3 rms=0.433541 max_abs=2.06283 top20_max_abs=1.08788 ds4-test: Tensor equivalence long_code_audit largest deltas: id=113847 ref=-11.6684 cand=-13.7312 abs=2.06283 id=123327 ref=-16.5266 cand=-18.47 abs=1.94345 id=113254 ref=-2.13925 cand=-4.08151 abs=1.94226 id=103699 ref=-0.1 64321 cand=-2.08601 abs=1.92169 id=56957 ref=-6.75335 cand=-4.83732 abs=1.91603 ds4-test: Tensor summary route=auto cases=5 capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 min_top5_overlap=4/5 min_overlap=18/20 worst_rank_delta=4 worst_rms=0.582494 worst_max_abs=3.37462 worst_top20_max_abs=1. 52215 metal-tensor-equivalence: OK server: server: OK ds4 tests: ok ``` the determenistic test run (with DS4_CUDA_MOE_NO_ATOMIC_DOWN=1) ``` $ DS4_CUDA_MOE_NO_ATOMIC_DOWN=1 ./ds4_test long-context: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.315s ds4: cuda backend initialized for graph diagnostics ds4-test: long-context prefill 0/30474 ds4-test: long-context prefill 8192/30474 ds4-test: long-context prefill 16384/30474 ^[^[ds4-test: long-context prefill 24576/30474 ds4-test: long-context prefill 30474/30474 long-context: OK tool-call-quality: ds4-test: tool-call quality fast path ds4-test: tool-call quality exact path ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics tool-call-quality: OK logprob-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics ds4-test: vector short_italian_fact ds4-test: vector short_code_completion ds4-test: vector short_reasoning_plain ds4-test: vector long_memory_archive skipped (API/official graph mismatch) ds4-test: vector long_code_audit logprob-vectors: OK local-golden-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics ds4-test: local golden long_story_4096 top1 ref=4371 cand=4371 top5_overlap=4/5 top20_overlap=16/20 top64_overlap=52/64 top20_max_abs=2.3658 local-golden-vectors: OK metal-short-prefill: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics metal-short-prefill: OK metal-kernels: ds4: CUDA registered 0.00 GiB model mapping for device access ds4: CUDA registered 0.00 GiB model mapping for device access ds4: CUDA registered 0.00 GiB model mapping for device access metal-kernels: OK metal-tensor-equivalence: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.346s ds4: cuda backend initialized for graph diagnostics ds4-test: Tensor equivalence candidate route=auto ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.346s ds4: cuda backend initialized for graph diagnostics ds4-test: Tensor equivalence short_italian_fact top1 ref=108149 cand=108149 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_italian_fact largest deltas: id=0 ref=-15.0155 cand=-15.0155 abs=0 id=1 ref=19.9358 cand=19.9358 abs=0 id=2 ref=-55.9084 cand=-55.9084 abs=0 id=3 ref=17.8982 ca nd=17.8982 abs=0 id=4 ref=26.0747 cand=26.0747 abs=0 ds4-test: Tensor equivalence short_code_completion top1 ref=9854 cand=9854 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_code_completion largest deltas: id=0 ref=-2.66161 cand=-2.66161 abs=0 id=1 ref=21.3162 cand=21.3162 abs=0 id=2 ref=-45.7824 cand=-45.7824 abs=0 id=3 ref=10.9651 cand=10.9651 abs=0 id=4 ref=25.8229 cand=25.8229 abs=0 ds4-test: Tensor equivalence short_reasoning_plain top1 ref=926 cand=926 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_reasoning_plain largest deltas: id=0 ref=-3.03494 cand=-3.03494 abs=0 id=1 ref=23.3849 cand=23.3849 abs=0 id=2 ref=-42.7991 cand=-42.7991 abs=0 id=3 ref=16.0927 cand=16.0927 abs=0 id=4 ref=18.5051 cand=18.5051 abs=0 ds4-test: Tensor equivalence long_memory_archive top1 ref=32111 cand=32111 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence long_memory_archive largest deltas: id=0 ref=-8.42831 cand=-8.42831 abs=0 id=1 ref=19.284 cand=19.284 abs=0 id=2 ref=-50.653 cand=-50.653 abs=0 id=3 ref=10.6968 cand= 10.6968 abs=0 id=4 ref=21.0302 cand=21.0302 abs=0 ds4-test: Tensor equivalence long_code_audit top1 ref=671 cand=671 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence long_code_audit largest deltas: id=0 ref=-4.50487 cand=-4.50487 abs=0 id=1 ref=19.7669 cand=19.7669 abs=0 id=2 ref=-47.0626 cand=-47.0626 abs=0 id=3 ref=16.7405 cand= 16.7405 abs=0 id=4 ref=23.0197 cand=23.0197 abs=0 ds4-test: Tensor summary route=auto cases=5 capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 min_top5_overlap=5/5 min_overlap=20/20 worst_rank_delta=0 worst_rms=0 worst_max_abs=0 worst_ top20_max_abs=0 metal-tensor-equivalence: OK server: server: OK ds4 tests: ok ``` the benchmark run ``` $ ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.348s ds4: cuda backend initialized for graph diagnostics ds4-bench: context buffers 1742.43 MiB (ctx=65665, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=16418) ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes 2048,2048,66.87,128,11.79,52184460 4096,2048,65.46,128,10.13,80373132 6144,2048,65.30,128,10.07,108561804 8192,2048,65.23,128,9.98,136750476 ... ``` the eval run (until the first failure) ``` $ ./ds4-eval --nothink --temp 3 --min-p 0.25 ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.314s ds4: cuda backend initialized for graph diagnostics ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000) ds4-eval: model shape DeepSeek V4 Flash ds4-eval: context buffers 718.95 MiB (ctx=16777, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=4196) ds4-eval: 14/92 passed, 1 failed, runtime 00h:12m # state prompt gen total given correct test 1 PASSED 201 462 663 B B GPQA Diamond/recNu3MXkvWUzHZr9 2 PASSED 149 64 213 C C SuperGPQA/001b51d76b4d422988f2c11f104a2c6c 3 PASSED 81 408 489 70 70 AIME2025/aime2025-01 4 PASSED 313 170 483 C C GPQA Diamond/recoiTJPGUmzAkief 5 PASSED 272 372 644 J J SuperGPQA/b7e20eac98764fb0bf30e8366d951daa 6 PASSED 146 788 934 468 468 AIME2025/aime2025-16 7 PASSED 156 496 652 B B GPQA Diamond/rec4UqStf9WUVif1f 8 PASSED 127 55 182 E E SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8 9 PASSED 633 2537 3170 588 588 AIME2025/aime2025-02 10 PASSED 182 382 564 B B GPQA Diamond/recgI6tUQ7RLJRWGx 11 PASSED 137 85 222 A A SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09 12 PASSED 165 612 777 16 16 AIME2025/aime2025-03 13 PASSED 149 901 1050 A A GPQA Diamond (modified)/recDytVnNYZe2HuUU 14 PASSED 167 76 243 J J SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a 15 FAILED 305 880 1185 96 82 AIME2025/aime2025-18 16 STOPPED 131 14 145 - D GPQA Diamond/recNFJjE5PPTqVJGv 17 PENDING 0 0 0 - I SuperGPQA/7ca71b86327744b78e93185a45bc5cef ... ```

mitsuhiko and others added 30 commits May 11, 2026 12:30

feat(server): report KV cache usage

0ca2e28

feat(server): report Anthropic cache usage

38800bf

README: separate motivations.

c5ef7ac

Merge branch 'pr-91-responses' into responses-api

2174611

Tighten Responses tool_search replay

6396966

Fix Responses tool checkpoint cache reuse

a01bf1d

Fix Responses API live continuation

acb40bf

metal: cover q4 expert tensors in model views

2a7a5f3

Skip tool checkpoint canonicalization for exact DSML replay

b4c5f7c

Merge responses-api

e88a71e

Use visible live checkpoints for toolless thinking

5453ad0

Clarify server progress logs

646798f

Add Anthropic live tool continuation

43535e1

Revert "metal: cover q4 expert tensors in model views"

67e6146

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Tag Responses API server logs

0083475

Recover Responses replays without hidden reasoning

0610591

Stream Anthropic tool calls live

94c1f38

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

fix typo in readme

741d0cc

dwarfstar is typoed to drawfstar

Merge pull request antirez#155 from kernelzeroday/main

98593ec

fix typo in readme

Fix typos in README.md

f6fa52b

Merge branch 'pr-150-context-error' into merge-pr-150-standard-context

157873b

user and others added 2 commits May 30, 2026 12:51

cuda: speed up Q4 routed MoE

dc51d64

Merge pull request antirez#272 from hexxyan/feat/cpu-q4k-routed-experts

ba00a8a

CPU support for Q4_K routed experts (fixes antirez#171)

user and others added 5 commits June 2, 2026 09:07

cuda: keep Flash paths building after PRO Q4 API changes

f7511c2

cuda: warm up top-k regression timing

bce69b0

download: remove legacy model targets

690b659

Use Hugging Face CLI for PRO model downloads

477c0e8

alantsev mentioned this pull request Jun 4, 2026

Support for AMD GPU (ROCm/HIP) backend #16

Open

Salvatore Sanfilippo and others added 8 commits June 4, 2026 13:43

Implement SSD streaming

9ba160a

cuda: bind optional model cache to its fd

d960611

Fix Metal view cap for full model maps

25b20eb

AGENT.md updated.

297f750

Fix distributed KV snapshot request IDs

5cd0739

Merge branch 'streaming'

01be833

cuda: stub streaming expert cache hooks

c47b15f

README updated.

5b95fa1

alantsev force-pushed the rocm branch from 2d046ac to 958c8fb Compare June 5, 2026 14:01

antirez and others added 8 commits June 6, 2026 11:06

Harden ds4-agent DSML parsing

a56519d

Show greedy sampling in agent status

02852fa

Show prefill speed in agent progress

56b25eb

Show agent web tool status messages

7ac436a

Make ds4-agent interruption cooperative

42e5915

Style ds4-agent system status messages

c463029

merge from main@upstream

6b4e917

alantsev force-pushed the rocm branch from 958c8fb to 9565c07 Compare June 6, 2026 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: fix gfx1151 correctness (precise expert-selection math), enable indexer, enable agent#311

rocm: fix gfx1151 correctness (precise expert-selection math), enable indexer, enable agent#311
alantsev wants to merge 211 commits into
antirez:rocmfrom
alantsev:rocm

alantsev commented May 31, 2026 •

edited

Loading

Uh oh!

harryhaaren commented Jun 1, 2026

Uh oh!

alantsev commented Jun 1, 2026

Uh oh!

alantsev commented Jun 1, 2026

Uh oh!

alantsev commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

alantsev commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harryhaaren commented Jun 1, 2026

Uh oh!

alantsev commented Jun 1, 2026

Uh oh!

alantsev commented Jun 1, 2026

Uh oh!

alantsev commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

alantsev commented May 31, 2026 •

edited

Loading