glm5-fp4-gb300-dynamo-sglang: extend 8k1k low-lat sweep with 1p17d topology#1583
Conversation
…pology Mirrors NVIDIA/srt-slurm#175: adds a 5th 8k1k_stp_lowlat_4 recipe with decode_nodes/workers=17, and lowers per-zip-index decode max-running-requests / cuda-graph-max-bs from a flat 4096 to 128/64/32/16/1 across lowlat_0..4. Benchmark concurrencies follow suit: 128/64/32/16/12. nvidia-master.yaml conc-list updated to match for each of the five 1p{3,5,9,15,17}d entries.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
LGTM — mechanical benchmark recipe / config update mirroring NVIDIA/srt-slurm#175; only finding is a trivial pr-link: TBD placeholder flagged inline.
Extended reasoning...
Overview
This PR is benchmark-only: it mirrors upstream NVIDIA/srt-slurm#175 by (1) adding a 5th low-latency topology (8k1k_stp_lowlat_4.yaml, 1p17d / 18 nodes), (2) lowering decode max-running-requests / cuda-graph-max-bs / benchmark.concurrencies per zip-index across lowlat_0..4 from a uniform 4096/1024 to 128/64/32/16/1 and 128/64/32/16/12, and (3) updating nvidia-master.yaml conc-list values plus appending a new entry for the 1p17d topology. perf-changelog.yaml gets a single entry documenting the update.
Security risks
None. No application/runtime code, no auth, no permissions, no data path, no dependency or container-image changes. Slurm recipe and benchmark-sweep tuning only.
Level of scrutiny
Low. The four edited recipe files are single-number tweaks; the new lowlat_4.yaml is a near-verbatim sibling of lowlat_0..3 (same base: block, same prefill config, only the topology-specific decode_nodes / decode_workers / decode override differ). The five conc-list edits in nvidia-master.yaml line up 1:1 with the per-recipe benchmark.concurrencies values, and the new 1p17d entry follows the existing search-space schema. Each value can be eyeballed against the upstream PR.
Other factors
The one finding — pr-link: TBD in the new changelog entry — is a self-referential metadata placeholder with zero runtime impact, posted inline. Nothing else stood out: conc-list matches recipe concurrency for every topology; num-worker matches decode_workers; the new recipe's decode override is internally consistent with the upstream zip-index pattern.
| - glm5-fp4-gb300-dynamo-sglang | ||
| description: | ||
| - "Update GB300 FP4 GLM-5 8k1k low-latency sweep to mirror NVIDIA/srt-slurm#175: add a 5th 1p17d topology (decode_nodes/workers=17), and lower decode max-running-requests / cuda-graph-max-bs / benchmark concurrency per-zip-index from a flat 4096/1024 to 128/64/32/16/1 (mrr & cuda-graph) and 128/64/32/16/12 (concurrency)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1583 |
There was a problem hiding this comment.
🟡 perf-changelog.yaml line 3208 has pr-link: TBD instead of https://github.com/SemiAnalysisAI/InferenceX/pull/1583. Every prior entry in the file uses the canonical https://github.com/SemiAnalysisAI/InferenceX/pull/<N> form, so the placeholder breaks the changelog's link-back convention and any URL-scraping automation. Trivial swap — just substitute the PR number.
Extended reasoning...
What is wrong
The new perf-changelog entry added by this PR (lines 3204–3208 of perf-changelog.yaml at HEAD) ends with:
pr-link: TBDrather than the canonical form used by every other entry in the file, e.g. the immediately-preceding entry on line 3202:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1579Since this is PR #1583, the value should be https://github.com/SemiAnalysisAI/InferenceX/pull/1583.
Addressing the refutation
One verifier objected that the <pr-diff> block embedded in the review prompt already shows the substituted URL, so there is nothing to fix. I checked the committed state directly:
$ git log -1 --format='%H %s'
0b73f198e48b1e8fdae5acbe8373bc14971283bd glm5-fp4-gb300-dynamo-sglang: extend 8k1k low-lat sweep with 1p17d topology
$ git show HEAD -- perf-changelog.yaml | tail -8
+- config-keys:
+ - glm5-fp4-gb300-dynamo-sglang
+ description:
+ - "Update GB300 FP4 GLM-5 8k1k low-latency sweep ..."
+ pr-link: TBD
+
HEAD (0b73f19) is the PR commit itself, and git show HEAD — i.e. the actual diff this commit introduces — adds pr-link: TBD. The URL in the prompt's <pr-diff> block does not match the committed content; the file as it would be merged has TBD. Three independent verifiers reached the same conclusion by reading the committed file directly. Per the standing guidance to "trust what you observe now" when state conflicts, the committed file is authoritative here.
Step-by-step proof
- Open
perf-changelog.yamlat line 3208 in the working tree. - Observe the literal string
pr-link: TBD. - Run
git grep -n "pr-link: TBD" perf-changelog.yaml— exactly one hit, line 3208. - Run
git grep -nE "pr-link: https://github.com" perf-changelog.yaml | wc -l— ~30 hits, every other entry uses the URL form. - The PR number is 1583 (from PR metadata), so the intended value is
https://github.com/SemiAnalysisAI/InferenceX/pull/1583.
Impact
No runtime, benchmark, or correctness impact — this is a changelog-metadata field. Consequences:
- Breaks the file's convention that every entry links back to its PR.
- Any tooling that scrapes
pr-link:values to build a changelog page, link-check, or cross-reference releases will either skip this entry or emit aTBDplaceholder. - Future readers of the changelog cannot click through to find the originating PR.
Fix
Replace line 3208:
- pr-link: TBD
+ pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1583|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26605445495 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26605445495 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26666839643 |
Summary
Mirrors NVIDIA/srt-slurm#175 (zip_override_8k1k_lowlat update for
recipes/gb300-fp4/glm5.yaml):8k1k_stp_lowlat_4.yamlwithdecode_nodes/decode_workers = 17(1p17d, 18 nodes total).max-running-requestsandcuda-graph-max-bsfrom a flat4096to128 / 64 / 32 / 16 / 1acrosslowlat_0..4.benchmark.concurrenciesfrom"1024"to"128" / "64" / "32" / "16" / "12"to match.nvidia-master.yamlin sync: eachglm5-fp4-gb300-dynamo-sglang8k1k lowlat entry'sconc-listnow matches its recipe; a newconc-list: [12]/decode.num-worker: 17entry is appended for the 1p17d topology.Note
Low Risk
Benchmark and Slurm recipe tuning only; no runtime application or auth changes.
Overview
Aligns GB300 FP4 GLM-5 8k1k disaggregated low-latency STP coverage with NVIDIA/srt-slurm#175: per-topology concurrency and decode batch limits instead of a flat 1024/4096 sweep.
For the four existing 1p3d → 1p15d recipes (
8k1k_stp_lowlat_0–3),conc-listinnvidia-master.yamldrops from 1024 to 128 / 64 / 32 / 16, and each recipe’s decodemax-running-requests,cuda-graph-max-bs, andbenchmark.concurrenciesare lowered to the same values (from 4096/1024).A fifth topology (1p17d, 18 nodes) is added via new
8k1k_stp_lowlat_4.yaml(17 decode workers) with concurrency 12 and decode limits 1, plus a matching master search-space entry pointing at8k1k_stp_lowlat_4.yaml.perf-changelog.yamldocuments the change underglm5-fp4-gb300-dynamo-sglang.Reviewed by Cursor Bugbot for commit cdafe29. Bugbot is set up for automated code reviews on this repo. Configure here.