Dense vs Mixture of Experts — Understanding AI Models (Chapter 3)

Modern large models use two primary architectural families: dense transformers and mixture-of-experts (MoE). Understanding the difference explains why models with high parameter counts can still run on modest hardware.

Dense transformers:

Every forward pass activates all parameters. A 70B dense model loads all 70B weights into VRAM and performs computation across the entire network for each token.

# Pseudo-code for dense forward pass
for layer in model.layers:
    hidden_states = layer(hidden_states)  # All 70B params involved

Mixture of Experts (MoE):

MoE models have many "expert" FFN networks but only activate a subset per token. A 46B-active/570B-total MoE model has 570 billion total parameters but only loads and computes with ~46 billion for each token.

# Pseudo-code for MoE forward pass
for layer in model.layers:
    top_k_indices = router(hidden_states)  # Select 2-8 experts
    for expert_idx in top_k_indices:
        hidden_states += experts[expert_idx](hidden_states)
    # Rest of 570B params sit idle

The router decision:

The router is a small neural network that selects which experts process each token. In Mixtral 8x7B, each token activates exactly 2 of 8 experts per layer. Over 44 layers, a single token touches 88 expert FFNs (2 x 44), with the rest idle.

Practical implications:

Property	Dense	MoE
VRAM at same total params	Higher	Lower (loads total, uses active)
Inference speed	Consistent	Can vary by token routing
Memory bandwidth	All params accessed	Only active params compute
Training stability	More stable	Requires careful tuning

Real example: DeepSeek-V2 has 236B total parameters but only 21B active per token, fitting in ~40GB VRAM. This gives dense-model inference speed with MoE parameter count.