Skip to content

glm5-fp4-gb300-dynamo-sglang: extend 8k1k low-lat sweep with 1p17d topology#1583

Merged
functionstackx merged 3 commits into
mainfrom
glm5-gb300-lowlat-update
May 29, 2026
Merged

glm5-fp4-gb300-dynamo-sglang: extend 8k1k low-lat sweep with 1p17d topology#1583
functionstackx merged 3 commits into
mainfrom
glm5-gb300-lowlat-update

Conversation

@Ankur-singh
Copy link
Copy Markdown
Collaborator

@Ankur-singh Ankur-singh commented May 28, 2026

Summary

Mirrors NVIDIA/srt-slurm#175 (zip_override_8k1k_lowlat update for recipes/gb300-fp4/glm5.yaml):

  • Adds a 5th low-lat topology 8k1k_stp_lowlat_4.yaml with decode_nodes / decode_workers = 17 (1p17d, 18 nodes total).
  • Lowers per-zip-index decode max-running-requests and cuda-graph-max-bs from a flat 4096 to 128 / 64 / 32 / 16 / 1 across lowlat_0..4.
  • Drops the per-recipe benchmark.concurrencies from "1024" to "128" / "64" / "32" / "16" / "12" to match.
  • Keeps nvidia-master.yaml in sync: each glm5-fp4-gb300-dynamo-sglang 8k1k lowlat entry's conc-list now matches its recipe; a new conc-list: [12] / decode.num-worker: 17 entry is appended for the 1p17d topology.

Note

Low Risk
Benchmark and Slurm recipe tuning only; no runtime application or auth changes.

Overview
Aligns GB300 FP4 GLM-5 8k1k disaggregated low-latency STP coverage with NVIDIA/srt-slurm#175: per-topology concurrency and decode batch limits instead of a flat 1024/4096 sweep.

For the four existing 1p3d → 1p15d recipes (8k1k_stp_lowlat_03), conc-list in nvidia-master.yaml drops from 1024 to 128 / 64 / 32 / 16, and each recipe’s decode max-running-requests, cuda-graph-max-bs, and benchmark.concurrencies are lowered to the same values (from 4096/1024).

A fifth topology (1p17d, 18 nodes) is added via new 8k1k_stp_lowlat_4.yaml (17 decode workers) with concurrency 12 and decode limits 1, plus a matching master search-space entry pointing at 8k1k_stp_lowlat_4.yaml. perf-changelog.yaml documents the change under glm5-fp4-gb300-dynamo-sglang.

Reviewed by Cursor Bugbot for commit cdafe29. Bugbot is set up for automated code reviews on this repo. Configure here.

…pology

Mirrors NVIDIA/srt-slurm#175: adds a 5th 8k1k_stp_lowlat_4 recipe with
decode_nodes/workers=17, and lowers per-zip-index decode
max-running-requests / cuda-graph-max-bs from a flat 4096 to
128/64/32/16/1 across lowlat_0..4. Benchmark concurrencies follow suit:
128/64/32/16/12. nvidia-master.yaml conc-list updated to match for each
of the five 1p{3,5,9,15,17}d entries.
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — mechanical benchmark recipe / config update mirroring NVIDIA/srt-slurm#175; only finding is a trivial pr-link: TBD placeholder flagged inline.

Extended reasoning...

Overview

This PR is benchmark-only: it mirrors upstream NVIDIA/srt-slurm#175 by (1) adding a 5th low-latency topology (8k1k_stp_lowlat_4.yaml, 1p17d / 18 nodes), (2) lowering decode max-running-requests / cuda-graph-max-bs / benchmark.concurrencies per zip-index across lowlat_0..4 from a uniform 4096/1024 to 128/64/32/16/1 and 128/64/32/16/12, and (3) updating nvidia-master.yaml conc-list values plus appending a new entry for the 1p17d topology. perf-changelog.yaml gets a single entry documenting the update.

Security risks

None. No application/runtime code, no auth, no permissions, no data path, no dependency or container-image changes. Slurm recipe and benchmark-sweep tuning only.

Level of scrutiny

Low. The four edited recipe files are single-number tweaks; the new lowlat_4.yaml is a near-verbatim sibling of lowlat_0..3 (same base: block, same prefill config, only the topology-specific decode_nodes / decode_workers / decode override differ). The five conc-list edits in nvidia-master.yaml line up 1:1 with the per-recipe benchmark.concurrencies values, and the new 1p17d entry follows the existing search-space schema. Each value can be eyeballed against the upstream PR.

Other factors

The one finding — pr-link: TBD in the new changelog entry — is a self-referential metadata placeholder with zero runtime impact, posted inline. Nothing else stood out: conc-list matches recipe concurrency for every topology; num-worker matches decode_workers; the new recipe's decode override is internally consistent with the upstream zip-index pattern.

Comment thread perf-changelog.yaml
- glm5-fp4-gb300-dynamo-sglang
description:
- "Update GB300 FP4 GLM-5 8k1k low-latency sweep to mirror NVIDIA/srt-slurm#175: add a 5th 1p17d topology (decode_nodes/workers=17), and lower decode max-running-requests / cuda-graph-max-bs / benchmark concurrency per-zip-index from a flat 4096/1024 to 128/64/32/16/1 (mrr & cuda-graph) and 128/64/32/16/12 (concurrency)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1583
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 perf-changelog.yaml line 3208 has pr-link: TBD instead of https://github.com/SemiAnalysisAI/InferenceX/pull/1583. Every prior entry in the file uses the canonical https://github.com/SemiAnalysisAI/InferenceX/pull/<N> form, so the placeholder breaks the changelog's link-back convention and any URL-scraping automation. Trivial swap — just substitute the PR number.

Extended reasoning...

What is wrong

The new perf-changelog entry added by this PR (lines 3204–3208 of perf-changelog.yaml at HEAD) ends with:

  pr-link: TBD

rather than the canonical form used by every other entry in the file, e.g. the immediately-preceding entry on line 3202:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1579

Since this is PR #1583, the value should be https://github.com/SemiAnalysisAI/InferenceX/pull/1583.

Addressing the refutation

One verifier objected that the <pr-diff> block embedded in the review prompt already shows the substituted URL, so there is nothing to fix. I checked the committed state directly:

$ git log -1 --format='%H %s'
0b73f198e48b1e8fdae5acbe8373bc14971283bd glm5-fp4-gb300-dynamo-sglang: extend 8k1k low-lat sweep with 1p17d topology

$ git show HEAD -- perf-changelog.yaml | tail -8
+- config-keys:
+    - glm5-fp4-gb300-dynamo-sglang
+  description:
+    - "Update GB300 FP4 GLM-5 8k1k low-latency sweep ..."
+  pr-link: TBD
+

HEAD (0b73f19) is the PR commit itself, and git show HEAD — i.e. the actual diff this commit introduces — adds pr-link: TBD. The URL in the prompt's <pr-diff> block does not match the committed content; the file as it would be merged has TBD. Three independent verifiers reached the same conclusion by reading the committed file directly. Per the standing guidance to "trust what you observe now" when state conflicts, the committed file is authoritative here.

Step-by-step proof

  1. Open perf-changelog.yaml at line 3208 in the working tree.
  2. Observe the literal string pr-link: TBD.
  3. Run git grep -n "pr-link: TBD" perf-changelog.yaml — exactly one hit, line 3208.
  4. Run git grep -nE "pr-link: https://github.com" perf-changelog.yaml | wc -l — ~30 hits, every other entry uses the URL form.
  5. The PR number is 1583 (from PR metadata), so the intended value is https://github.com/SemiAnalysisAI/InferenceX/pull/1583.

Impact

No runtime, benchmark, or correctness impact — this is a changelog-metadata field. Consequences:

  • Breaks the file's convention that every entry links back to its PR.
  • Any tooling that scrapes pr-link: values to build a changelog page, link-check, or cross-reference releases will either skip this entry or emit a TBD placeholder.
  • Future readers of the changelog cannot click through to find the originating PR.

Fix

Replace line 3208:

-  pr-link: TBD
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1583

@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit e0cd8f7 into main May 29, 2026
5 of 6 checks passed
@functionstackx functionstackx deleted the glm5-gb300-lowlat-update branch May 29, 2026 23:10
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants