Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 45 additions & 6 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1862,7 +1862,7 @@ dsr1-fp4-mi355x-sglang-disagg:
- "DECODE_MTP_SIZE=0"

dsr1-fp4-mi355x-sglang-disagg-mtp:
image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519
image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260529
model: amd/DeepSeek-R1-0528-MXFP4-v2
model-prefix: dsr1
runner: mi355x-disagg
Expand Down Expand Up @@ -2030,11 +2030,11 @@ dsr1-fp4-mi355x-sglang-disagg-mtp:
dp-attn: false
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=2"
- "DECODE_MTP_SIZE=3"

# 1*DEP8 + 1*DEP8
- spec-decoding: "mtp"
conc-list: [ 128, 512 ]
conc-list: [ 384, 512 ]
prefill:
num-worker: 1
tp: 8
Expand All @@ -2049,11 +2049,11 @@ dsr1-fp4-mi355x-sglang-disagg-mtp:
dp-attn: true
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=1"
- "DECODE_MTP_SIZE=3"

# 1*DEP8 + 1*DEP8
- spec-decoding: "mtp"
conc-list: [ 64, 256 ]
conc-list: [ 192, 256 ]
prefill:
num-worker: 1
tp: 8
Expand All @@ -2068,7 +2068,46 @@ dsr1-fp4-mi355x-sglang-disagg-mtp:
dp-attn: true
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=1"
- "DECODE_MTP_SIZE=3"


# 1*DEP8 + 1*DEP8
- spec-decoding: "mtp"
conc-list: [ 96, 128 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=3"

# 1*DEP8 + 1*DEP8
- spec-decoding: "mtp"
conc-list: [ 48, 64 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=3"
Comment on lines +2091 to +2110
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This PR introduces duplicate benchmark sweep points: after bumping the two pre-existing 1*DEP8 + 1*DEP8 blocks in the ISL=8192 search-space from DECODE_MTP_SIZE=1 to DECODE_MTP_SIZE=3 (lines 2052, 2071), and then adding a new block with conc-list [64, 128] and the same MTP=3 (lines 2073-2090), all three blocks now share byte-identical topology. conc=64 and conc=128 will each be benchmarked twice on the expensive mi355x-disagg multinode runner. Fix by either dropping 64 and 128 from the new block, or consolidating the three blocks into a single entry with conc-list: [64, 128, 256, 512].

Extended reasoning...

What the bug is

In .github/configs/amd-master.yaml, under dsr1-fp4-mi355x-sglang-disagg-mtp / scenarios.fixed-seq-len / isl: 8192, the post-PR YAML contains three 1*DEP8 + 1*DEP8 search-space entries that, after this PR's edits, have byte-identical prefill and decode configuration. The only thing that differs across them is the conc-list:

Lines conc-list DECODE_MTP_SIZE pre-PR DECODE_MTP_SIZE post-PR
~2034-2052 [128, 512] 1 3 (bumped by PR)
~2053-2071 [64, 256] 1 3 (bumped by PR)
~2073-2090 [64, 128] (new) 3 (added by PR)

All three entries are: spec-decoding: mtp, prefill num-worker=1 tp=8 ep=8 dp-attn=true PREFILL_NODES=1, decode num-worker=1 tp=8 ep=8 dp-attn=true DECODE_NODES=1 DECODE_MTP_SIZE=3.

Why existing code does not prevent it

The sweep generator (utils/matrix_logic/generate_sweep_configs.py, multinode branch) emits one matrix entry per search-space entry, passing the conc-list through unchanged. There is no cross-entry deduplication step: each (config, conc) pair from each entry becomes a real benchmark run. So overlapping conc values across two configurationally-identical entries produce two real, identical benchmark runs.

Step-by-step proof

  1. Reader sees the three entries above. All three have identical prefill+decode configuration after the PR.
  2. For the [128, 512] entry the matrix generator will emit benchmark runs at conc=128 and conc=512.
  3. For the [64, 256] entry it will emit runs at conc=64 and conc=256.
  4. For the new [64, 128] entry it will emit runs at conc=64 and conc=128.
  5. Therefore conc=64 is benchmarked twice (entries 2 and 3) and conc=128 is benchmarked twice (entries 1 and 3), each with byte-identical sglang/topology settings.

Why this PR is responsible

Before this PR, the two pre-existing entries had DECODE_MTP_SIZE=1, so any new MTP=3 entry would not have duplicated them. This PR creates the duplication by doing both of:

  • (a) Bumping the two existing entries' DECODE_MTP_SIZE from 1 to 3 (lines 2052 and 2071 of the diff), homogenizing their topology with the new entry, and
  • (b) Adding a new entry with conc-list: [64, 128] and DECODE_MTP_SIZE=3 whose conc values overlap with the two pre-existing entries.

Either change alone would have been fine; the combination produces the duplication.

Impact

This runner is mi355x-disagg with multinode: true. Each benchmark run uses multiple multinode-disagg-grade nodes for isl=8192 osl=1024, so two duplicate sweep points are not free CI-time — they are real GPU-hours on a constrained multinode runner, producing no new data.

How to fix

One-line edits, pick either:

  • Drop 64 and 128 from the new block's conc-list. (But after dropping them, the new block has no novel conc values — implying the entry should simply be removed.)
  • Or consolidate the three blocks into a single entry with conc-list: [64, 128, 256, 512], which is cleaner and matches the apparent intent of running the full set at MTP=3.


# 2*DEP8 + 1*DEP8
- spec-decoding: "mtp"
Expand Down
3 changes: 2 additions & 1 deletion benchmarks/multi_node/amd_utils/env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,8 @@ else
export SGLANG_USE_AITER=1

export SGLANG_MORI_DISPATCH_DTYPE=auto
export SGLANG_MORI_FP8_COMB=true
export MORI_COMBINE_DTYPE_PREFILL=fp8_direct_cast
export MORI_COMBINE_DTYPE_DECODE=fp8
export SGLANG_MORI_QP_PER_TRANSFER=4
export SGLANG_MORI_NUM_WORKERS=4
export MORI_IO_SQ_BACKOFF_TIMEOUT_US=50000
Expand Down
7 changes: 7 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3201,6 +3201,13 @@
- "MoRI conn.py overlay (48e459bd) via job.slurm; launcher qwen3.5_fp4_mi355x_sglang-disagg.sh"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1579

- config-keys:
- dsr1-fp4-mi355x-sglang-disagg-mtp
description:
- "Bump the image to May 26"
- "Add conc 128/256 new sweep point"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1584

- config-keys:
- glm5-fp8-gb300-dynamo-sglang
description:
Expand Down