03. Dense vs Mixture of Experts
Modern large models use two primary architectural families: dense transformers and mixture-of-experts (MoE). Understanding the difference explains why models with high parameter counts can still run on modest hardware.
Dense transformers:
Every forward pass activates all parameters. A 70B dense model loads all 70B weights into VRAM and performs computation across the entire network for each token.
# Pseudo-code for dense forward pass
for layer in model.layers:
hidden_states = layer(hidden_states) # All 70B params involved
Mixture of Experts (MoE):
MoE models have many "expert" FFN networks but only activate a subset per token. A 46B-active/570B-total MoE model has 570 billion total parameters but only loads and computes with ~46 billion for each token.
# Pseudo-code for MoE forward pass
for layer in model.layers:
top_k_indices = router(hidden_states) # Select 2-8 experts
for expert_idx in top_k_indices:
hidden_states += experts[expert_idx](hidden_states)
# Rest of 570B params sit idle
The router decision:
The router is a small neural network that selects which experts process each token. In Mixtral 8x7B, each token activates exactly 2 of 8 experts per layer. Over 44 layers, a single token touches 88 expert FFNs (2 x 44), with the rest idle.
Practical implications:
| Property | Dense | MoE |
|---|---|---|
| VRAM at same total params | Higher | Lower (loads total, uses active) |
| Inference speed | Consistent | Can vary by token routing |
| Memory bandwidth | All params accessed | Only active params compute |
| Training stability | More stable | Requires careful tuning |
Real example: DeepSeek-V2 has 236B total parameters but only 21B active per token, fitting in ~40GB VRAM. This gives dense-model inference speed with MoE parameter count.
Calculate the active parameter ratio for Mixtral 8x7B (8 experts, top-2 routing). Then estimate how this affects VRAM if all weights were loaded but only active ones computed.