Feature Request: DenseMixer - dense forward pass for MoE router gradient estimation during RL

## Summary
Request to investigate supporting [DenseMixer](https://github.com/yaof20/DenseMixer)-style training in Megatron Core's MoE layer. DenseMixer computes all expert outputs during the forward pass and uses a straight-through estimator (STE) to provide more precise gradients to the router, bypassing the non-differentiable Top-K
selection. 

The MoE architecture is unchanged — this is a training-mode flag that enables dense expert computation for better router gradient estimation during post-training (SFT/RL).

## Motivation
Standard Top-K routing is non-differentiable — the router only receives gradient signal from selected experts, limiting its ability to learn optimal routing during fine-tuning. DenseMixer addresses this by computing all expert outputs in the forward pass for gradient estimation while preserving sparse Top-K selection at inference.

Results across multiple MoE architectures during SFT:
- Qwen1.5-MoE-A2.7B: +2.2% average across 7 benchmarks
- OLMoE-1B-7B: +2.9% average
- Qwen3-30B-A3B: +3.7% on GPQA-Diamond

The overhead is modest: ~1.46× FLOPs (expert weights are already loaded, so memory overhead is negligible) with 9–29% wall-clock increase depending on dataset size. No inference cost. Compatible with LoRA/PEFT.

## Requested Feature
Investigate adding a configuration flag in Megatron Core's MoE layer to enable dense expert computation with STE-based router gradient estimation during the forward/backward pass. Standard sparse Top-K routing is preserved at inference. 

This flag would allow downstream RL frameworks that use Megatron Core as their training backend to opt into improved router training without modifying core MoE internals.

## References
- [DenseMixer (Yao, Cui, Zhang, Liu et al.)](https://github.com/yaof20/DenseMixer)
- [Technical blog](https://fengyao.notion.site/moe-posttraining)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: DenseMixer - dense forward pass for MoE router gradient estimation during RL #4169

Summary

Motivation

Requested Feature

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: DenseMixer - dense forward pass for MoE router gradient estimation during RL #4169

Description

Summary

Motivation

Requested Feature

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions