Skip to content

Integrate MoE as first-class component#451

Open
rrutmann wants to merge 7 commits into
moefrom
moe_refactor
Open

Integrate MoE as first-class component#451
rrutmann wants to merge 7 commits into
moefrom
moe_refactor

Conversation

@rrutmann

@rrutmann rrutmann commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

This PR migrates MoE support from the side package under moe into the main modalities codebase, integrating it as first-class registry components with FSDP2 and Expert Parallel (EP) support. It also adds a new MoE training config set and a focused MoE unit test suite.

General Changes

  • Core MoE integration
  • Use shared core for rotary embeddings for qwen and GPT2 model
  • Added configs for FSDP and EP MoE training
  • Runtime fixes:
    • Mixed precision dtype mismatch fix in expert matmul path (BF16 activations vs FP32 expert params)
    • Checkpoint optimizer-state compatibility fix for custom EP optimizer state schemas
  • Added tests

Breaking Changes

  • ..

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@rrutmann rrutmann requested a review from gbesposito June 11, 2026 09:58
@rrutmann rrutmann self-assigned this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant