HOW-TO · INF

How to understand MoE architecture and expert routing

intermediate15 minBy Fredoline Eruo
PREREQUISITES

Basic understanding of transformer models

What this does

Mixture-of-Experts (MoE) splits a model into multiple "expert" sub-networks with a router that selects which experts activate per token. This guide explains the architecture and shows how to inspect routing behavior at inference time.

Steps

  1. Understand expert count and activation. DeepSeek-V3 has 256 experts, activating 8 per token. Mixtral 8x7B has 8 experts, activating 2 per token.

  2. Inspect routing decisions via logits. Use the Ollama API to retrieve raw output scores:

    curl -s http://localhost:11434/api/generate \
      -d '{"model": "deepseek-r1:14b", "prompt": "What is MoE?", "options": {"temperature": 0}}' \
      | jq '.'
    
  3. Measure expert load balance. A healthy MoE model distributes tokens evenly. Skewed routing indicates training issues.

    import torch
    # Pseudo: retrieve per-expert token counts from router logits
    # Balanced: each expert receives ~total_tokens / num_experts tokens
    
  4. Compare active vs. total parameters. DeepSeek-V3 has 671B total but activates only ~37B per token. Verify efficient compute:

    ollama show deepseek-r1:14b | grep -i parameter
    

Verification

# Check model parameter distribution
ollama show deepseek-r1:14b
# Expected: "total parameters: 671B, active parameters: 37B per token"

Common failures

  • Confusing total vs. active parameter counts: MoE papers list total parameters first; active count is what determines compute cost.
  • Router collapse: If training went wrong, all tokens route to the same expert. Balanced routing is a health indicator.
  • Overlooking expert capacity: Each expert has a token budget; exceeding it causes token dropping in some implementations.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

RELATED GUIDES