Summary
Request to investigate supporting DenseMixer-style training in Megatron Core's MoE layer. DenseMixer computes all expert outputs during the forward pass and uses a straight-through estimator (STE) to provide more precise gradients to the router, bypassing the non-differentiable Top-K
selection.
The MoE architecture is unchanged — this is a training-mode flag that enables dense expert computation for better router gradient estimation during post-training (SFT/RL).
Motivation
Standard Top-K routing is non-differentiable — the router only receives gradient signal from selected experts, limiting its ability to learn optimal routing during fine-tuning. DenseMixer addresses this by computing all expert outputs in the forward pass for gradient estimation while preserving sparse Top-K selection at inference.
Results across multiple MoE architectures during SFT:
- Qwen1.5-MoE-A2.7B: +2.2% average across 7 benchmarks
- OLMoE-1B-7B: +2.9% average
- Qwen3-30B-A3B: +3.7% on GPQA-Diamond
The overhead is modest: ~1.46× FLOPs (expert weights are already loaded, so memory overhead is negligible) with 9–29% wall-clock increase depending on dataset size. No inference cost. Compatible with LoRA/PEFT.
Requested Feature
Investigate adding a configuration flag in Megatron Core's MoE layer to enable dense expert computation with STE-based router gradient estimation during the forward/backward pass. Standard sparse Top-K routing is preserved at inference.
This flag would allow downstream RL frameworks that use Megatron Core as their training backend to opt into improved router training without modifying core MoE internals.
References
Summary
Request to investigate supporting DenseMixer-style training in Megatron Core's MoE layer. DenseMixer computes all expert outputs during the forward pass and uses a straight-through estimator (STE) to provide more precise gradients to the router, bypassing the non-differentiable Top-K
selection.
The MoE architecture is unchanged — this is a training-mode flag that enables dense expert computation for better router gradient estimation during post-training (SFT/RL).
Motivation
Standard Top-K routing is non-differentiable — the router only receives gradient signal from selected experts, limiting its ability to learn optimal routing during fine-tuning. DenseMixer addresses this by computing all expert outputs in the forward pass for gradient estimation while preserving sparse Top-K selection at inference.
Results across multiple MoE architectures during SFT:
The overhead is modest: ~1.46× FLOPs (expert weights are already loaded, so memory overhead is negligible) with 9–29% wall-clock increase depending on dataset size. No inference cost. Compatible with LoRA/PEFT.
Requested Feature
Investigate adding a configuration flag in Megatron Core's MoE layer to enable dense expert computation with STE-based router gradient estimation during the forward/backward pass. Standard sparse Top-K routing is preserved at inference.
This flag would allow downstream RL frameworks that use Megatron Core as their training backend to opt into improved router training without modifying core MoE internals.
References