[AMD] improve dsr1 fp4 disagg#1584
Conversation
> Co-authored-by: billishyahao <bill.he@amd.com> > Co-authored-by: Duyi-Wang <duyi.wang@amd.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| - "DECODE_MTP_SIZE=3" | ||
|
|
||
| # 1*DEP8 + 1*DEP8 | ||
| - spec-decoding: "mtp" | ||
| conc-list: [ 64, 128 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 1 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=1" | ||
| - "DECODE_MTP_SIZE=3" |
There was a problem hiding this comment.
🔴 This PR introduces duplicate benchmark sweep points: after bumping the two pre-existing 1*DEP8 + 1*DEP8 blocks in the ISL=8192 search-space from DECODE_MTP_SIZE=1 to DECODE_MTP_SIZE=3 (lines 2052, 2071), and then adding a new block with conc-list [64, 128] and the same MTP=3 (lines 2073-2090), all three blocks now share byte-identical topology. conc=64 and conc=128 will each be benchmarked twice on the expensive mi355x-disagg multinode runner. Fix by either dropping 64 and 128 from the new block, or consolidating the three blocks into a single entry with conc-list: [64, 128, 256, 512].
Extended reasoning...
What the bug is
In .github/configs/amd-master.yaml, under dsr1-fp4-mi355x-sglang-disagg-mtp / scenarios.fixed-seq-len / isl: 8192, the post-PR YAML contains three 1*DEP8 + 1*DEP8 search-space entries that, after this PR's edits, have byte-identical prefill and decode configuration. The only thing that differs across them is the conc-list:
| Lines | conc-list | DECODE_MTP_SIZE pre-PR | DECODE_MTP_SIZE post-PR |
|---|---|---|---|
| ~2034-2052 | [128, 512] |
1 |
3 (bumped by PR) |
| ~2053-2071 | [64, 256] |
1 |
3 (bumped by PR) |
| ~2073-2090 | [64, 128] |
(new) | 3 (added by PR) |
All three entries are: spec-decoding: mtp, prefill num-worker=1 tp=8 ep=8 dp-attn=true PREFILL_NODES=1, decode num-worker=1 tp=8 ep=8 dp-attn=true DECODE_NODES=1 DECODE_MTP_SIZE=3.
Why existing code does not prevent it
The sweep generator (utils/matrix_logic/generate_sweep_configs.py, multinode branch) emits one matrix entry per search-space entry, passing the conc-list through unchanged. There is no cross-entry deduplication step: each (config, conc) pair from each entry becomes a real benchmark run. So overlapping conc values across two configurationally-identical entries produce two real, identical benchmark runs.
Step-by-step proof
- Reader sees the three entries above. All three have identical prefill+decode configuration after the PR.
- For the
[128, 512]entry the matrix generator will emit benchmark runs at conc=128 and conc=512. - For the
[64, 256]entry it will emit runs at conc=64 and conc=256. - For the new
[64, 128]entry it will emit runs at conc=64 and conc=128. - Therefore
conc=64is benchmarked twice (entries 2 and 3) andconc=128is benchmarked twice (entries 1 and 3), each with byte-identical sglang/topology settings.
Why this PR is responsible
Before this PR, the two pre-existing entries had DECODE_MTP_SIZE=1, so any new MTP=3 entry would not have duplicated them. This PR creates the duplication by doing both of:
- (a) Bumping the two existing entries'
DECODE_MTP_SIZEfrom1to3(lines 2052 and 2071 of the diff), homogenizing their topology with the new entry, and - (b) Adding a new entry with
conc-list: [64, 128]andDECODE_MTP_SIZE=3whose conc values overlap with the two pre-existing entries.
Either change alone would have been fine; the combination produces the duplication.
Impact
This runner is mi355x-disagg with multinode: true. Each benchmark run uses multiple multinode-disagg-grade nodes for isl=8192 osl=1024, so two duplicate sweep points are not free CI-time — they are real GPU-hours on a constrained multinode runner, producing no new data.
How to fix
One-line edits, pick either:
- Drop
64and128from the new block'sconc-list. (But after dropping them, the new block has no novel conc values — implying the entry should simply be removed.) - Or consolidate the three blocks into a single entry with
conc-list: [64, 128, 256, 512], which is cleaner and matches the apparent intent of running the full set at MTP=3.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26619796352 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26666404342 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26619796352 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26666404342 |
f425a6e to
bfb242c
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26668569482 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26668569482 |
Improve dsr1 fp4 disagg
Note
Low Risk
Benchmark matrix and changelog only; no application runtime or auth changes.
Overview
Updates DeepSeek-R1 FP4 MI355X SGLang disaggregated MTP benchmark coverage in
.github/configs/amd-master.yamland records it inperf-changelog.yaml.The
dsr1-fp4-mi355x-sglang-disagg-mtpcontainer moves fromlmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519tov0.5.12.post1-rocm720-mi35x-20260529. Several MTP disagg rows now setDECODE_MTP_SIZE=3(was 1 or 2) on 1P1D and 1P2D decode paths.For 8k1k, a new 1×DEP8 + 1×DEP8 MTP scenario sweeps concurrency
[64, 128](with existing points adjusted toward 128/256-style coverage per the changelog). Other 8k1k DEP8 MTP entries also alignDECODE_MTP_SIZEto 3 where they were 1.Reviewed by Cursor Bugbot for commit bfb242c. Bugbot is set up for automated code reviews on this repo. Configure here.