Skip to content

Feature Request: DenseMixer - dense forward pass for MoE router gradient estimation during RL #4169

@sbhavani

Description

@sbhavani

Summary

Request to investigate supporting DenseMixer-style training in Megatron Core's MoE layer. DenseMixer computes all expert outputs during the forward pass and uses a straight-through estimator (STE) to provide more precise gradients to the router, bypassing the non-differentiable Top-K
selection.

The MoE architecture is unchanged — this is a training-mode flag that enables dense expert computation for better router gradient estimation during post-training (SFT/RL).

Motivation

Standard Top-K routing is non-differentiable — the router only receives gradient signal from selected experts, limiting its ability to learn optimal routing during fine-tuning. DenseMixer addresses this by computing all expert outputs in the forward pass for gradient estimation while preserving sparse Top-K selection at inference.

Results across multiple MoE architectures during SFT:

  • Qwen1.5-MoE-A2.7B: +2.2% average across 7 benchmarks
  • OLMoE-1B-7B: +2.9% average
  • Qwen3-30B-A3B: +3.7% on GPQA-Diamond

The overhead is modest: ~1.46× FLOPs (expert weights are already loaded, so memory overhead is negligible) with 9–29% wall-clock increase depending on dataset size. No inference cost. Compatible with LoRA/PEFT.

Requested Feature

Investigate adding a configuration flag in Megatron Core's MoE layer to enable dense expert computation with STE-based router gradient estimation during the forward/backward pass. Standard sparse Top-K routing is preserved at inference.

This flag would allow downstream RL frameworks that use Megatron Core as their training backend to opt into improved router training without modifying core MoE internals.

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions