Skip to content

Enable NVFP4 fused grouped MLP SwiGLU#5

Closed
sraman-rgb wants to merge 1 commit into
pggPL:grouped_gemm_nvfp4_and_hopperfrom
sraman-rgb:nvfp4-grouped-mlp-fc1-swiglu
Closed

Enable NVFP4 fused grouped MLP SwiGLU#5
sraman-rgb wants to merge 1 commit into
pggPL:grouped_gemm_nvfp4_and_hopperfrom
sraman-rgb:nvfp4-grouped-mlp-fc1-swiglu

Conversation

@sraman-rgb
Copy link
Copy Markdown

Stacked on NVIDIA#2971.

This PR adds the incremental NVFP4 grouped MLP SwiGLU fusion changes on top of PR 2971, without modifying PR 2971 itself and without the temporary Python bulk_allocate fallback.

Changes:

  • Enable NVFP4 recipes in grouped MLP fusion selection.
  • Add NVFP4 forward/backward fused grouped MLP SwiGLU handling.
  • Use grouped tensor GEMM for NVFP4 FC1 dgrad with grouped_fc1_weight directly.
  • Fix NVFP4 discrete-input grouped GEMM layout metadata.
  • Avoid the single-group split-size host sync in grouped linear.

Validation:

  • NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 python3 -m pytest -q --tb=short -ra tests/pytorch/test_fusible_ops.py::TestSequentialModules::test_grouped_mlp -k 'scaled_swiglu and nvfp4' -> 48 passed, 336 skipped
  • NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 python3 -m pytest -q --tb=short -ra tests/pytorch/test_fusible_ops.py::TestSequentialModules::test_grouped_mlp -> 1216 passed, 4544 skipped

@sraman-rgb sraman-rgb closed this May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant