[CUDA] Add SM120 NVF4 block-scale MMA support by qqq-tao · Pull Request #2364 · tile-ai/tilelang

qqq-tao · 2026-06-09T10:27:58Z

Summary

This PR adds SM120 NVFP4 block-scaled MMA support and exposes it through the TileLang tile-op path so NVFP4 GEMM kernels can be written and validated directly in TileLang.

The main user-facing entry is still the lightweight tile API, T.mma_gemm_blockscaled(...). The SM120-specific logic is kept in the CUDA GEMM lowering / SM120 helper layer rather than exposed as a new scheduling system.

This is intended as a follow-up to the earlier SM120 NVFP4 work. The scope is:

add a clean SM120 warp-level mma.sync block-scaled path,
provide a TileLang GEMM example and CUTLASS 79a comparison harness,
validate fragment layout, scale mapping, codegen, constant-scale correctness, and varying-scale correctness,
document the current performance status and the remaining optimization path.

It does not claim that the current TileLang public example has reached CUTLASS-level performance yet.

What changed

Added SM120 warp-level NVFP4 block-scaled MMA codegen for:
- m16n8k64.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X
- E2M1 A/B operands
- UE4M3 scale factors
- FP32 accumulation
Added T.mma_gemm_blockscaled(...) as the TileLang-facing tile-op wrapper for the SM120 mma.sync block-scaled path.
Added block-scaled MMA fragment layout and scale-factor lane mapping coverage against the CUTLASS/CuTe SM120 layout contract.
Added deterministic varying-scale correctness coverage. The test packs per-16K UE4M3 scale bytes into TileLang's uint32 scale-word format and checks against a scale-aware FP32 reference.
Added the required CUDA helper in src/tl_templates/cuda/gemm_sm120.h, so the SM120 MMA instruction is no longer only represented in Python-side lowering logic.
Added examples/gemm_sm120/nvfp4_gemm_compare.py, which can:
- run the TileLang NVFP4 GEMM example,
- verify small-shape correctness,
- optionally build/run CUTLASS GeForce example 79a for comparison,
- control warmup/repeat/backend settings for basic performance checks.
Fixed shared-memory allocation byte accounting for packed scalar float4_e2m1fn, so packed FP4 shared buffers reserve two logical FP4 values per byte.
Tightened the shared-memory merge offset handling for packed scalar FP4 in the non-alias rewrite path and added a regression test for that offset unit conversion.

Validation

Focused tests:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6 \
TILELANG_DISABLE_CACHE=1 \
python -m pytest \
  testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py \
  testing/python/transform/test_tilelang_transform_merge_shared_memory_allocations.py \
  -q

Result:

17 passed

Example correctness smoke:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6 \
TILELANG_DISABLE_CACHE=1 \
python examples/gemm_sm120/nvfp4_gemm_compare.py \
  --m 512 --n 512 --k 512 \
  --block-m 128 --block-n 128 --block-k 256 \
  --num-stages 2 --out-dtype bfloat16 \
  --verify --warmup-ms 1 --rep-ms 1 \
  --n-warmup 1 --n-repeat 1 --backend event

Result on the SM120 dev machine:

TileLang correctness: passed
TileLang latency: 0.0246 ms
TileLang FLOPS: 10.91 TFLOPS

Repository checks:

pre-commit run --all-files

Result:

Passed

Performance status

The PR includes a benchmark harness, but the current public TileLang example should be treated as a correctness-first SM120 NVFP4 baseline.

During local RTX 5090 / SM120a investigation, I also tested larger 8192^3 configurations and several private optimization variants around:

128x128x256, stages=2,
packed FP4 shared-memory footprint,
TMA producer paths for operands / scale factors,
warp-specialized producer-consumer scaffolding,
alternative block-scaled operand / scale staging orders.

Those experiments showed that TileLang can get into the same broad performance range as a usable SM120 NVFP4 GEMM path, but the remaining gap to CUTLASS 79a is not solved by a small source-order tweak. The stronger finding is that TileLang still needs a cleaner SM120 block-scaled operand/scale fragment lifecycle in the GEMM lowering. In particular, the current public path does not yet implement a first-class A/B/SFA/SFB operand package or a consumer pipeline that overlaps next-fragment setup with current MMA issue.

So the performance-specific work is intentionally framed as next-stage compiler/lowering work rather than hidden behind benchmark-only flags in this PR.

Notes on TMA and warp specialization

TileLang has general TMA and warp-specialization infrastructure, and I tested private SM120 NVFP4 variants using those mechanisms. I am not promoting those paths in this PR yet because the clean implementation boundary is not just "turn on TMA" or "turn on warp specialization".

For this PR, the stable contribution is the tile-op API, the SM120 block-scaled MMA helper, the NVFP4 GEMM example, correctness coverage, and the packed-FP4 shared-memory fixes needed for the exact 128x128x256 tile to compile and launch correctly.

Future work

The next optimization stage should keep the public API small and move performance-specific logic into the SM120 block-scaled GEMM lowering path:

add a dedicated SM120 block-scaled operand/scale fragment package in the GEMM emitter,
improve A/B/SFA/SFB fragment staging and reuse around the OMMA.SF issue sequence,
add a clean TMA-capable load path for operands and scale factors where it fits TileLang's pipeline model,
evaluate automatic or handwritten warp specialization as part of that clean lowering path,
keep the C++ template layer focused on PTX/asm primitives, not a handwritten full GEMM mainloop.

Add warp-level mxf4nvf4 block-scale MMA lowering and coverage so TileLang can validate NVF4 kernels against SM120/CUTLASS behavior.

github-actions · 2026-06-09T10:28:07Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-06-09T10:28:13Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds SM120 NVF4 block-scale MMA support end-to-end: TL builtin intrinsic, PTX/mma template, CodeGen emission, 4-bit layout transforms and emitter wiring, CUTLASS CUDA reference kernel, TileLang correctness harness with packing/decoding utilities, and CUDA-gated tests.

Changes

NVF4 Block-Scale MMA Support

Layer / File(s)	Summary
Compiler op and TIR intrinsic surface `src/op/builtin.{cc,h}`, `tilelang/language/ast/ir.py`, `tilelang/language/tir/ir.py`, `tilelang/language/tir/op.py`	Register `tl.ptx_mma_block_scale` as a TL builtin with 17 inputs and expose dtype-forwarding TIR/AST wrappers plus an op wrapper that converts metadata into `StringImm`/`IntImm` before invoking the intrinsic.
PTX instruction template and config `src/tl_templates/cuda/gemm_sm120.h`	Define SM120 enums and `SM120MmaBlockScaledConfig` with a specialization enabling exactly `kMxf4nvf4`/scale-vec-4/`kUE4M3`, add inline-PTX device implementation for `mma.sync.aligned.m16n8k64`, and provide a guarded `sm120_mma_sync_blockscaled` template wrapper.
CUDA codegen emission `src/cuda/codegen/codegen_cuda.cc`	Extend CodeGenTileLangCUDA to recognize `tl.ptx_mma_block_scale` (17 args), validate the supported configuration, rewrite fp4-packed A/B buffer operands (halving offsets where packed), and emit `tl::sm120_mma_sync_blockscaled` with template argument substitution.
Layout transforms and MMA emitter `tilelang/cuda/intrinsics/layout/mma_layout.py`, `tilelang/cuda/intrinsics/macro/mma_macro_generator.py`, `tilelang/cuda/intrinsics/{macro/,}__init__.py`, `tilelang/intrinsics/__init__.py`	Add shared↔MMA layout mappings for 4-bit A/B operands and inverse load helpers; introduce `BlockScaleMmaConfig` registry; implement `TensorCoreIntrinEmitterWithBlockScale` subclass that overrides `ldmatrix_a`/`ldmatrix_b` and emits per-warp `ptx_mma_block_scale` calls using explicit scale buffers and computed packed K indices; re-export emitter through package initializers.
CUTLASS reference kernel `maint/gemm/cutlass_nvf4_ref.cu`	Add a PyTorch-bound CUTLASS SM120 block-scaled GEMM entry `cutlass_nvf4_gemm_128x128x256` that validates tensor devices/contiguity and exact packed sizes, constructs CUTLASS problem/stride/layout views including SFA/SFB, checks `can_implement`, allocates workspace, runs the kernel, and synchronizes.
TileLang evaluation harness `maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py`	Add a correctness-evaluation script: TileLang NVF4 primfunc generator with optional swizzled shared layouts, FP4/UE4M3 packing/repacing utilities for TileLang vs CUTLASS, FP4 decode to float32 reference, CUTLASS extension builder/loader, execution of both implementations, metric printing, and optional strict equality assertion.
Tests and correctness validation `testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py`	Add CUDA-gated tests: FP4 decode table, swizzle layout helper, NVF4 primfunc generator, input/scale generators, fragment-layout coverage tests, lane-to-atom mapping tests, unsupported-config/type rejection tests, codegen source assertions, and end-to-end correctness tests comparing kernel outputs to a decoded float32 reference with zero tolerance.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

lucifer1004
LeiWang1999

Poem

🐰 In silken shared tiles I hop and pack,
Four-bit whispers stitched along each track,
Scales march in words, the warps all hum,
TileLang builds, CUTLASS comes to sum,
A rabbit nods — the matrices align.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.06% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[CUDA] Add SM120 NVF4 block-scale MMA support' accurately and concisely describes the primary change—adding warp-level block-scale MMA support for NVF4 on SM120 GPUs.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

Apply clang-format updates and remove an unused block-scale emitter local.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py`:
- Around line 253-256: The code sets cutlass_root using a hardcoded developer
path which is unsafe; update the logic in the block that defines repo and
cutlass_root (variables cutlass_root, repo and the use of
os.environ["CUTLASS_ROOT"]) to remove the hardcoded "/data/home/..." fallback
and instead: prefer the CUTLASS_ROOT environment variable if set, otherwise fall
back to repo / "3rdparty" / "cutlass"; ensure cutlass_root.exists() is checked
and raise a clear error or log if neither location exists so the failure is
explicit.

In `@tilelang/language/ast/ir.py`:
- Line 2140: The module's __all__ includes "ptx_mma_block_scale" but no symbol
by that name is defined or imported, which breaks wildcard imports; either
remove "ptx_mma_block_scale" from the __all__ list or add a proper
definition/import for ptx_mma_block_scale (e.g., define the function/class or
import it from its source) so the name is actually bound in this module; update
the __all__ entry near the existing list and ensure the symbol name matches
exactly the defined/imported identifier.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a5bd78f-7a51-446a-bd8a-0078295fbd05

📥 Commits

Reviewing files that changed from the base of the PR and between a3f7093 and a24f9f1.

📒 Files selected for processing (16)

maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py
maint/gemm/cutlass_nvf4_ref.cu
src/cuda/codegen/codegen_cuda.cc
src/cuda/codegen/codegen_cuda.h
src/op/builtin.cc
src/op/builtin.h
src/tl_templates/cuda/instruction/mma_block_scale.h
testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py
tilelang/cuda/intrinsics/__init__.py
tilelang/cuda/intrinsics/layout/mma_layout.py
tilelang/cuda/intrinsics/macro/__init__.py
tilelang/cuda/intrinsics/macro/mma_macro_generator.py
tilelang/intrinsics/__init__.py
tilelang/language/ast/ir.py
tilelang/language/tir/ir.py
tilelang/language/tir/op.py

Apply ruff formatting and remove trailing whitespace from the NVF4 block-scale emitter changes.

Remove a developer-specific CUTLASS fallback path, bind the block-scale MMA AST helper, and apply clang-format to the CUDA codegen changes.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tilelang/cuda/intrinsics/macro/mma_macro_generator.py (2)

1402-1407: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve outer-dimension bases for plain scale buffers.

_scale_region_parts() treats Buffer differently from the A/B paths and drops every prefix dimension by returning ([], 0, 0). The accesses at Lines 1478-1482 and Line 1510 then only index the trailing two axes, which breaks direct N-D scale buffers and sliced views. Reuse _legalize_to_buffer_region() here so scale buffers follow the same region contract as A/B.

Suggested fix

     `@staticmethod`
     def _scale_region_parts(scale_buf: Buffer | BufferRegion):
-        if isinstance(scale_buf, BufferRegion):
-            return scale_buf.buffer, [r.min for r in scale_buf.region[:-2]], scale_buf.region[-2].min, scale_buf.region[-1].min
-        if isinstance(scale_buf, Buffer):
-            return scale_buf, [], 0, 0
-        raise ValueError(f"Unsupported scale buffer type: {type(scale_buf)}")
+        region = TensorCoreIntrinEmitter._legalize_to_buffer_region(scale_buf)
+        return (
+            region.buffer,
+            [r.min for r in region.region[:-2]],
+            region.region[-2].min,
+            region.region[-1].min,
+        )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py` around lines 1402 -
1407, _scale_region_parts currently returns (buffer, [], 0, 0) for plain Buffer
which drops outer-dimension bases and breaks N-D scale buffers; change it to
call and reuse _legalize_to_buffer_region(scale_buf) so both Buffer and
BufferRegion follow the same region contract as A/B. Update _scale_region_parts
to accept either type, call _legalize_to_buffer_region when scale_buf is a
Buffer to obtain the buffer and full region, and then extract the same tuples
(buffer, [r.min for r in region[:-2]], region[-2].min, region[-1].min) as for
BufferRegion; keep the existing ValueError for unsupported types. Ensure
references to _scale_region_parts usage (the indexing code that expects
preserved outer bases) continue to work unchanged.

1282-1333: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the fixed NVF4 contract in the constructor.

This emitter always lowers as mxf4nvf4 with k64, e2m1/e2m1, and the block-scale fragment layouts from this class, but it never rejects incompatible a_dtype, b_dtype, or accum_dtype inputs. A caller can currently instantiate this public API with, for example, float16 operands and still get NVF4 PTX emitted against mismatched fragment assumptions. Fail fast here instead of silently generating wrong code.

Suggested guard

     def __init__(
         self,
         a_dtype: str = T.float4_e2m1fn,
         b_dtype: str = T.float4_e2m1fn,
         accum_dtype: str = T.float32,
@@
         kind: str = "mxf4nvf4",
         scale_vec_size: int = 4,
         stype: str = "ue4m3",
     ):
+        if str(DataType(a_dtype)) != str(T.float4_e2m1fn) or str(DataType(b_dtype)) != str(T.float4_e2m1fn):
+            raise ValueError("SM120 block-scale MMA currently only supports float4_e2m1fn operands")
+        if str(DataType(accum_dtype)) != str(T.float32):
+            raise ValueError("SM120 block-scale MMA currently only supports float32 accumulation")
         self.block_scale_config = _get_block_scale_mma_config(kind, scale_vec_size, stype)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py` around lines 1282 -
1333, The constructor currently forces block-scale config to NVF4 but does not
validate the caller-provided dtypes, so callers can pass incompatible
a_dtype/b_dtype/accum_dtype and silently generate wrong code; after calling
_get_block_scale_mma_config(...) in __init__, add a guard that checks the
resolved self.block_scale_config.kind (and/or its expected dtype descriptors
from the config) against the incoming a_dtype, b_dtype, and accum_dtype
parameters (e.g., ensure a_dtype and b_dtype match the NVF4 fragment dtypes such
as T.float4_e2m1fn and accum_dtype matches the expected accumulator like
T.float32 or whatever the config exposes), and raise a ValueError with a clear
message if they mismatch; perform this validation before calling
super().__init__ so invalid combinations fail fast.

maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py (1)

342-354: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The FP32 reference diagnostics are missing the UE4M3 block scales.

ref is built from the decoded FP4 payloads only, so diff_tl_ref and diff_cutlass_ref are not checking the same computation as the block-scaled GEMMs. In the default NVF4_SCALE_MODE="varying" case, those numbers are inherently misleading. Either apply sfa_logical / sfb_logical per 16-K chunk when building ref, or drop these prints until that scale-aware reference exists.

Suggested minimal change

-    ref = _decode_rowmajor_fp4(a, M, K) @ _decode_rowmajor_fp4(b, N, K).T
-    diff_tl_ref = (c_tl - ref).abs()
-    diff_cutlass_ref = (c_cutlass - ref).abs()
     print("scale_mode:", scale_mode)
     print("input_mode:", input_mode)
     print("max_abs_diff:", diff.max().item())
     print("mean_abs_diff:", diff.mean().item())
     print("max_abs_diff_transposed:", diff_t.max().item())
     print("mean_abs_diff_transposed:", diff_t.mean().item())
-    print("max_abs_diff_tilelang_ref:", diff_tl_ref.max().item())
-    print("mean_abs_diff_tilelang_ref:", diff_tl_ref.mean().item())
-    print("max_abs_diff_cutlass_ref:", diff_cutlass_ref.max().item())
-    print("mean_abs_diff_cutlass_ref:", diff_cutlass_ref.mean().item())

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py` around lines 342 - 354,
The FP32 reference `ref` is computed from raw decoded FP4 payloads so it doesn't
include the UE4M3 block scales (`sfa_logical`/`sfb_logical`), making
`diff_tl_ref` and `diff_cutlass_ref` invalid under NVF4_SCALE_MODE="varying";
fix by applying the per-block scales to the decoded tensors before matmul: after
calling `_decode_rowmajor_fp4(a, M, K)` and `_decode_rowmajor_fp4(b, N, K)`,
multiply each 16xK chunk of the decoded A by the corresponding entries in
`sfa_logical` and each 16xK chunk of the decoded B by `sfb_logical` (or apply
equivalent broadcasting per 16-K block) so `ref = scaled_decoded_a @
scaled_decoded_b.T` matches the block-scaled GEMM, otherwise remove the
`diff_tl_ref`/`diff_cutlass_ref` prints until the scale-aware reference is
implemented; reference symbols: `ref`, `_decode_rowmajor_fp4`, `sfa_logical`,
`sfb_logical`, `diff_tl_ref`, `diff_cutlass_ref`.

♻️ Duplicate comments (1)

maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py (1)

247-256: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast when the CUTLASS root is missing.

This now avoids the developer-local fallback, but it still passes non-existent include roots into load(). If CUTLASS_ROOT is unset or wrong and 3rdparty/cutlass is absent, the harness fails later with a compiler error instead of an explicit setup error.

Suggested fix

     repo = Path(__file__).resolve().parents[2]
     cutlass_root_env = os.environ.get("CUTLASS_ROOT")
     cutlass_root = Path(cutlass_root_env) if cutlass_root_env else repo / "3rdparty" / "cutlass"
+    if not cutlass_root.exists():
+        raise RuntimeError(
+            f"CUTLASS not found at {cutlass_root}. "
+            "Set CUTLASS_ROOT or ensure 3rdparty/cutlass exists."
+        )
 
     return load(

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py` around lines 247 - 256,
Check that the computed cutlass_root (from cutlass_root_env or repo / "3rdparty"
/ "cutlass") actually exists before calling load; if it does not exist, raise a
clear RuntimeError instructing the developer to set CUTLASS_ROOT or populate
3rdparty/cutlass, and only pass existing include paths (cutlass_root / "include"
and cutlass_root / "tools" / "util" / "include") into the load(...) call instead
of blindly passing non-existent paths; refer to the variables cutlass_root_env,
cutlass_root, extra_include_paths and the load(...) call to locate where to add
the existence check and error raise.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py`:
- Around line 342-354: The FP32 reference `ref` is computed from raw decoded FP4
payloads so it doesn't include the UE4M3 block scales
(`sfa_logical`/`sfb_logical`), making `diff_tl_ref` and `diff_cutlass_ref`
invalid under NVF4_SCALE_MODE="varying"; fix by applying the per-block scales to
the decoded tensors before matmul: after calling `_decode_rowmajor_fp4(a, M, K)`
and `_decode_rowmajor_fp4(b, N, K)`, multiply each 16xK chunk of the decoded A
by the corresponding entries in `sfa_logical` and each 16xK chunk of the decoded
B by `sfb_logical` (or apply equivalent broadcasting per 16-K block) so `ref =
scaled_decoded_a @ scaled_decoded_b.T` matches the block-scaled GEMM, otherwise
remove the `diff_tl_ref`/`diff_cutlass_ref` prints until the scale-aware
reference is implemented; reference symbols: `ref`, `_decode_rowmajor_fp4`,
`sfa_logical`, `sfb_logical`, `diff_tl_ref`, `diff_cutlass_ref`.

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py`:
- Around line 1402-1407: _scale_region_parts currently returns (buffer, [], 0,
0) for plain Buffer which drops outer-dimension bases and breaks N-D scale
buffers; change it to call and reuse _legalize_to_buffer_region(scale_buf) so
both Buffer and BufferRegion follow the same region contract as A/B. Update
_scale_region_parts to accept either type, call _legalize_to_buffer_region when
scale_buf is a Buffer to obtain the buffer and full region, and then extract the
same tuples (buffer, [r.min for r in region[:-2]], region[-2].min,
region[-1].min) as for BufferRegion; keep the existing ValueError for
unsupported types. Ensure references to _scale_region_parts usage (the indexing
code that expects preserved outer bases) continue to work unchanged.
- Around line 1282-1333: The constructor currently forces block-scale config to
NVF4 but does not validate the caller-provided dtypes, so callers can pass
incompatible a_dtype/b_dtype/accum_dtype and silently generate wrong code; after
calling _get_block_scale_mma_config(...) in __init__, add a guard that checks
the resolved self.block_scale_config.kind (and/or its expected dtype descriptors
from the config) against the incoming a_dtype, b_dtype, and accum_dtype
parameters (e.g., ensure a_dtype and b_dtype match the NVF4 fragment dtypes such
as T.float4_e2m1fn and accum_dtype matches the expected accumulator like
T.float32 or whatever the config exposes), and raise a ValueError with a clear
message if they mismatch; perform this validation before calling
super().__init__ so invalid combinations fail fast.

---

Duplicate comments:
In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py`:
- Around line 247-256: Check that the computed cutlass_root (from
cutlass_root_env or repo / "3rdparty" / "cutlass") actually exists before
calling load; if it does not exist, raise a clear RuntimeError instructing the
developer to set CUTLASS_ROOT or populate 3rdparty/cutlass, and only pass
existing include paths (cutlass_root / "include" and cutlass_root / "tools" /
"util" / "include") into the load(...) call instead of blindly passing
non-existent paths; refer to the variables cutlass_root_env, cutlass_root,
extra_include_paths and the load(...) call to locate where to add the existence
check and error raise.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5bd71dcd-3f86-40b2-b9d8-6c494431985b

📥 Commits

Reviewing files that changed from the base of the PR and between a24f9f1 and c3c51c9.

📒 Files selected for processing (6)

maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py
maint/gemm/cutlass_nvf4_ref.cu
src/cuda/codegen/codegen_cuda.cc
src/tl_templates/cuda/instruction/mma_block_scale.h
tilelang/cuda/intrinsics/macro/mma_macro_generator.py
tilelang/language/ast/ir.py

🚧 Files skipped from review as they are similar to previous changes (3)

src/cuda/codegen/codegen_cuda.cc
src/tl_templates/cuda/instruction/mma_block_scale.h
maint/gemm/cutlass_nvf4_ref.cu

Make the CUTLASS comparison reference scale-aware, validate block-scale emitter dtypes, preserve scale buffer regions, and check CUTLASS include paths explicitly.

Rachmanino · 2026-06-11T06:12:11Z

Really appreciate your contribution! I think it's better to provide high-level API for blockscaled mma on sm120, like existing T.tcgen05_gemm_blockscaled for sm100 tcgen5? Besides, it may be better to move blockscaled mma instructions to src/tl_templates/cuda/gemm_sm120.h? Would love to discuss if you have better suggestions!

qqq-tao · 2026-06-12T08:08:52Z

Thanks, that makes sense.

This PR is currently a low-level SM120 NVF4 MMA emitter. It checks that the new mma.sync instruction works, and that the fragment layout and scale-factor lane mapping are correct.

I agree that users should have a higher-level API, similar to the existing SM100 blockscaled GEMM API. The SM120 version should not copy the TCGEN05/TMEM behavior directly, but it can provide a similar user experience on top of the SM120 warp-level mma.sync path.

For this PR, I can move the SM120 blockscaled MMA helper into gemm_sm120.h, so it sits with the SM120 GEMM code. Then we can add a higher-level tile op API as a follow-up, unless you prefer to include that in this PR.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tilelang/language/tir/op.py (1)
1079-1119: ⚠️ Potential issue | 🟠 Major

Fix tl.ptx_mma_block_scale intrinsic arity: set_num_inputs(21) doesn’t match CUDA lowering expectations.

src/op/builtin.cc:244-247 registers tl.ptx_mma_block_scale with .set_num_inputs(21).

src/cuda/codegen/codegen_cuda.cc:2852 lowers the intrinsic with ICHECK_EQ(op->args.size(), 17U) and consumes op->args[0..16] (accum_dtype/shape/layouts/k/vec_size/dtypes + A/B/C pointers & offsets + scale_a/scale_b).

The Python wrapper language/tir/op.py:1079-1119 passes exactly those 17 call_intrin args, so the registration count should be aligned (likely change .set_num_inputs(21) → 17).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/language/tir/op.py` around lines 1079 - 1119, The registered
intrinsic tl.ptx_mma_block_scale has a mismatched arity: the Python wrapper
function ptx_mma_block_scale builds 17 call_intrin arguments and the CUDA
lowering asserts ICHECK_EQ(..., 17), so update the intrinsic registration to use
.set_num_inputs(17) (replace the current .set_num_inputs(21)) so the
registration count matches the call_intrin arguments and the CUDA lowering
expectations.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tilelang/language/tir/op.py`:
- Around line 1079-1119: The registered intrinsic tl.ptx_mma_block_scale has a
mismatched arity: the Python wrapper function ptx_mma_block_scale builds 17
call_intrin arguments and the CUDA lowering asserts ICHECK_EQ(..., 17), so
update the intrinsic registration to use .set_num_inputs(17) (replace the
current .set_num_inputs(21)) so the registration count matches the call_intrin
arguments and the CUDA lowering expectations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 82a9e9cc-83ba-44c4-acfa-79f6cc76f5fa

📥 Commits

Reviewing files that changed from the base of the PR and between cbc83fc and 17d2313.

📒 Files selected for processing (5)

src/cuda/codegen/codegen_cuda.cc
src/tl_templates/cuda/gemm_sm120.h
testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py
tilelang/cuda/intrinsics/macro/mma_macro_generator.py
tilelang/language/tir/op.py

💤 Files with no reviewable changes (1)

tilelang/cuda/intrinsics/macro/mma_macro_generator.py

🚧 Files skipped from review as they are similar to previous changes (1)

testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py

Rachmanino · 2026-06-23T08:49:16Z

Nice work! May I ask whether we support TMA load for operands and SFs? If so, we may also illustrate that in our example. Also, I'm curious whether you've considered warp-specialization (either automatic version or handwritten) and its influence on the performance.

qqq-tao · 2026-06-24T06:54:23Z

Yes, I am still tuning to optimize the performance.

Scalar float4_e2m1fn shared buffers are physically packed at two logical values per byte. The shared-memory planner already stores offsets in bytes, but the direct merged-buffer rewrite path must convert those byte offsets back to logical FP4 buffer indices before codegen applies nibble-level addressing. Add an explicit FP4-aware byte-offset conversion helper and use the packed-size helper for debug allocation logging. Update the SM120 NVFP4 block-scale codegen test to check formula-derived offsets for every K case instead of a K=256-only string check. Add a non-alias merge regression that exercises the direct rewrite path and verifies that a 16-byte aligned FP4 allocation starts at logical element offset 32. Validation: - ninja -C build - pytest testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py testing/python/transform/test_tilelang_transform_merge_shared_memory_allocations.py -q - pre-commit run --files src/transform/merge_shared_memory_allocations.cc testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py testing/python/transform/test_tilelang_transform_merge_shared_memory_allocations.py

Extend the SM120 NVFP4 blockscale MMA language tests with a deterministic varying-scale case. The new case packs per-16K UE4M3 scale bytes into the uint32 scale-word format consumed by the tile op and compares against a scale-aware FP32 reference. Use power-of-two scale bytes so the reference remains exact enough for the existing zero-tolerance correctness check, while still exercising non-constant SFA/SFB packing and lane use. Validation: focused NVFP4 pytest now reports 17 passed; pre-commit run --all-files passed; 512^3 nvfp4_gemm_compare correctness smoke passed.

[CUDA] Add SM120 NVF4 block-scale MMA support

a24f9f1

Add warp-level mxf4nvf4 block-scale MMA lowering and coverage so TileLang can validate NVF4 kernels against SM120/CUTLASS behavior.

[Lint] Fix NVF4 pre-commit issues

07a9b6d

Apply clang-format updates and remove an unused block-scale emitter local.

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py Outdated

Comment thread tilelang/language/ast/ir.py

qutao added 2 commits June 9, 2026 18:44

[Lint] Apply pre-commit formatting fixes

cb925fa

Apply ruff formatting and remove trailing whitespace from the NVF4 block-scale emitter changes.

[Lint] Address NVF4 PR review findings

c3c51c9

Remove a developer-specific CUTLASS fallback path, bind the block-scale MMA AST helper, and apply clang-format to the CUDA codegen changes.

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

[CUDA] Address NVF4 block-scale review findings

cbc83fc

Make the CUTLASS comparison reference scale-aware, validate block-scale emitter dtypes, preserve scale buffer regions, and check CUTLASS include paths explicitly.

Refine SM120 NVF4 blockscaled MMA API

17d2313

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

qutao added 3 commits June 12, 2026 18:04

Fix SM120 blockscale intrinsic arity

d724f3c

Add SM120 NVF4 blockscaled MMA tile op

6a4c776

Add SM120 NVFP4 GEMM benchmark

8d02f23

qutao added 4 commits June 29, 2026 18:33

Fix NVFP4 shared allocation footprint

c8cdf24

Apply clang-format to SM120 MMA helper

ec8d48c

Uh oh!

Conversation

qqq-tao commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Validation

Performance status

Notes on TMA and warp specialization

Future work

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Rachmanino commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qqq-tao commented Jun 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Rachmanino commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qqq-tao commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qqq-tao commented Jun 9, 2026 •

edited

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Rachmanino commented Jun 11, 2026 •

edited

Loading

Rachmanino commented Jun 23, 2026 •

edited

Loading