06. Hardware Requirements

Chapter 6 of 18 · 20 min

Deploying R1 requires careful hardware planning. The model's size, MoE architecture, and reasoning token generation create specific resource demands that differ from standard LLM serving.

Memory Requirements

R1's memory footprint breaks into three components:

  • Model weights: ~350GB (FP16) for full 671B model; ~185GB (INT8); ~98GB (INT4)
  • KV cache: Variable based on sequence length and batch size; MLA reduces this significantly
  • Activation memory: Proportional to batch size and sequence length

For a single GPU deployment, you'll need INT4 quantization. This typically requires:

# Memory budget for single GPU serving
model_bits = 4  # INT4 quantization
total_params = 671e9  # 671B parameters
model_bytes = (model_bits / 8) * total_params  # ~336 GB for INT4

# This exceeds single A100/H100 VRAM (80GB)
# Need tensor parallelism or quantized base model

# Distilled R1-Distill-Qwen offers single GPU option
distill_params = 32e9  # 32B active parameters
distill_bytes = (4 / 8) * distill_params  # ~16 GB

Compute Requirements

During active computation, MoE models are compute-bound by active parameters. R1's 37B active parameters per token means each forward pass requires roughly 74 TFLOPS (assuming 2 ops per parameter per token).

# Theoretical throughput calculations
a100_tflops = 312e12  # 312 TFLOPS for A100
tokens_per_second = a100_tflops / 74e12  # ~4.2 tokens/s

# With 8x tensor parallelism across A100s
effective_throughput = 4.2 * 8  # ~33 tokens/s

Reality is lower due to communication overhead and memory bandwidth limitations. Expect 20-30 tokens/s across 8x A100 for generation.

Reasoning Token Impact on Hardware

Extended reasoning chains consume memory and compute proportionally. A request generating 1000 reasoning tokens followed by 200 output tokens requires compute for 1200 total tokens. Memory for KV cache scales with sequence length, so long reasoning chains consume significant GPU memory.

# KV cache sizing for reasoning workloads
seq_length = 2048  # 1000 reasoning + 1000 context + 48 output
layers = 61
heads = 128
head_dim = 128
bytes_per_element = 2  # FP16

kv_cache_per_request = (
    2 *  # keys and values
    seq_length *
    layers *
    heads *
    head_dim *
    bytes_per_element
)
# ~4 GB per request in KV cache alone

With 80GB VRAM, you can fit approximately 15 concurrent requests in KV cache before model weights.

EXERCISE

Calculate the minimum GPU configuration needed to serve R1-Distill with <2s latency for 1000-token reasoning chains at 10 concurrent requests. Include memory and compute requirements.