02. DeepSeek R1 Architecture

Chapter 2 of 18 · 15 min

DeepSeek R1's architecture builds on the standard transformer stack but introduces training innovations that make extended reasoning possible. Understanding these internals helps operators debug failures, optimize inference, and make informed deployment decisions.

Multi-Head Latent Attention (MLA)

R1 uses Multi-Head Latent Attention, a variant of multi-head attention that reduces KV cache memory through low-rank decomposition. During standard MHA, each head maintains its own key and value projections. MLA projects these into a shared latent space, then decompresses at attention time. The memory savings are substantial: approximately 50% reduction in KV cache size compared to standard MHA at equivalent quality.

# Conceptual comparison of attention memory footprint
# Standard MHA: 12 layers, 64 heads, 128 head_dim
# Memory per token = 12 * 64 * 128 * 2 * 4 bytes ≈ 786 KB

# MLA with latent_dim = 512
# Memory per token = 12 * 512 * 2 * 4 bytes ≈ 49 KB (94% reduction)

This memory reduction directly impacts your batch sizes and throughput calculations.

DeepSeek MoE Architecture

R1 employs a Mixture of Experts architecture with 256 routed experts per layer, activating 8 per token. The expert routing is load-balanced through auxiliary losses during training, ensuring no single expert becomes a bottleneck. This architecture delivers the quality of a dense model with substantially lower active parameter counts.

The math: R1 has 671B total parameters but only ~37B active parameters per token. During inference, you only compute through the active experts, dramatically reducing FLOPs relative to parameter count.

Reinforcement Learning Training for Reasoning

Unlike models trained purely on next-token prediction, R1 underwent RL training using Group Relative Policy Optimization (GRPO). This approach generates multiple responses per prompt, evaluates their quality through reward signals (correctness, format, helpfulness), and updates the policy based on relative performance.

The RL training is what produces the explicit chain-of-thought behavior. The model learns to verbalize intermediate steps because doing so improves reward outcomes—it's not just mimicking human reasoning examples, it's optimizing for explicit reasoning as a strategy.

EXERCISE

Calculate the active parameter count for a single forward pass through R1. Given A100 throughput of ~300 TFLOPs/s, estimate maximum tokens per second possible. Compare this to your latency requirements.