RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Understanding AI Models
  6. /Ch. 3
Understanding AI Models

03. Dense vs Mixture of Experts

Chapter 3 of 20 · 15 min
KEY INSIGHT

MoE models let you trade memory capacity for parameter count-total params are large, but active params per token are manageable.

Modern large models use two primary architectural families: dense transformers and mixture-of-experts (MoE). Understanding the difference explains why models with high parameter counts can still run on modest hardware.

Dense transformers:

Every forward pass activates all parameters. A 70B dense model loads all 70B weights into VRAM and performs computation across the entire network for each token.

# Pseudo-code for dense forward pass
for layer in model.layers:
    hidden_states = layer(hidden_states)  # All 70B params involved

Mixture of Experts (MoE):

MoE models have many "expert" FFN networks but only activate a subset per token. A 46B-active/570B-total MoE model has 570 billion total parameters but only loads and computes with ~46 billion for each token.

# Pseudo-code for MoE forward pass
for layer in model.layers:
    top_k_indices = router(hidden_states)  # Select 2-8 experts
    for expert_idx in top_k_indices:
        hidden_states += experts[expert_idx](hidden_states)
    # Rest of 570B params sit idle

The router decision:

The router is a small neural network that selects which experts process each token. In Mixtral 8x7B, each token activates exactly 2 of 8 experts per layer. Over 44 layers, a single token touches 88 expert FFNs (2 x 44), with the rest idle.

Practical implications:

Property Dense MoE
VRAM at same total params Higher Lower (loads total, uses active)
Inference speed Consistent Can vary by token routing
Memory bandwidth All params accessed Only active params compute
Training stability More stable Requires careful tuning

Real example: DeepSeek-V2 has 236B total parameters but only 21B active per token, fitting in ~40GB VRAM. This gives dense-model inference speed with MoE parameter count.

EXERCISE

Calculate the active parameter ratio for Mixtral 8x7B (8 experts, top-2 routing). Then estimate how this affects VRAM if all weights were loaded but only active ones computed.

← Chapter 2
Parameter Count Guide
Chapter 4 →
Quantization Explained