How to understand MoE architecture and expert routing
Basic understanding of transformer models
What this does
Mixture-of-Experts (MoE) splits a model into multiple "expert" sub-networks with a router that selects which experts activate per token. This guide explains the architecture and shows how to inspect routing behavior at inference time.
Steps
Understand expert count and activation. DeepSeek-V3 has 256 experts, activating 8 per token. Mixtral 8x7B has 8 experts, activating 2 per token.
Inspect routing decisions via logits. Use the Ollama API to retrieve raw output scores:
curl -s http://localhost:11434/api/generate \ -d '{"model": "deepseek-r1:14b", "prompt": "What is MoE?", "options": {"temperature": 0}}' \ | jq '.'Measure expert load balance. A healthy MoE model distributes tokens evenly. Skewed routing indicates training issues.
import torch # Pseudo: retrieve per-expert token counts from router logits # Balanced: each expert receives ~total_tokens / num_experts tokensCompare active vs. total parameters. DeepSeek-V3 has 671B total but activates only ~37B per token. Verify efficient compute:
ollama show deepseek-r1:14b | grep -i parameter
Verification
# Check model parameter distribution
ollama show deepseek-r1:14b
# Expected: "total parameters: 671B, active parameters: 37B per token"
Common failures
- Confusing total vs. active parameter counts: MoE papers list total parameters first; active count is what determines compute cost.
- Router collapse: If training went wrong, all tokens route to the same expert. Balanced routing is a health indicator.
- Overlooking expert capacity: Each expert has a token budget; exceeding it causes token dropping in some implementations.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.