07. Memory Optimization
Serving reasoning models efficiently requires aggressive memory optimization. The combination of large model weights, long reasoning chains, and concurrent requests creates memory pressure that demands systematic optimization strategies.
Quantization Strategies
The first optimization is model quantization. Beyond standard INT8/INT4, reasoning models benefit from specific quantization approaches:
# Quantization approach comparison for R1
quantization_configs = {
"FP16": {"bits": 16, "quality": 1.0, "memory_gb": 1342},
"INT8": {"bits": 8, "quality": 0.98, "memory_gb": 671},
"INT4": {"bits": 4, "quality": 0.92, "memory_gb": 336},
"GPTQ-INT4": {"bits": 4, "quality": 0.94, "memory_gb": 336},
"AWQ-INT4": {"bits": 4, "quality": 0.95, "memory_gb": 336},
}
# For production: AWQ or GPTQ with calibration data
# Avoid naive INT4 that degrades reasoning quality
AWQ (Activation-Aware Weight Quantization) preserves more quality than naive INT4 by focusing quantization error on high-activation weights. For reasoning models, this matters—quantization errors in attention computation propagate through long reasoning chains.
KV Cache Optimization
KV cache memory grows linearly with sequence length and concurrent requests. R1's MLA reduces this by ~60%, but reasoning workloads still generate long sequences.
# KV cache management strategies
class ReasoningKVCacheManager:
def __init__(self, max_cache_mb=40960):
self.max_cache_mb = max_cache_mb
self.active_caches = {}
self.eviction_policy = "lru"
def allocate_request(self, request_id, max_tokens):
"""Allocate cache proportional to expected reasoning length"""
# Reasoning tasks need more cache than standard tasks
estimated_tokens = max_tokens * 1.5 # Reasoning buffer overhead
cache_size = estimated_tokens * self.bytes_per_token
if self.total_cache_used + cache_size > self.max_cache_mb:
self.evict_least_recent()
return self.active_caches[request_id]
def bytes_per_token(self):
# With MLA: ~0.5 KB per token per layer
# Standard attention: ~1.2 KB per token per layer
return 0.5 * 1024 * 1024 * 61 # ~30 MB per token for 61 layers
Paged Attention
For reasoning workloads with variable-length reasoning chains, paged attention enables fine-grained memory allocation. Rather than pre-allocating for worst-case sequence length, you allocate physical memory blocks and map logical sequence positions dynamically.
# Paged attention configuration for reasoning
paged_attn_config = {
"block_size": 16, # Tokens per block
"max_blocks": 4096, # Max blocks per sequence
"num_allocator_blocks": 10240, # Physical blocks
"eviction_enabled": True,
}
# This allows up to 64K tokens per sequence (4096 * 16)
# while only allocating memory for actual usage
CPU Offloading
For deployments where GPU memory is insufficient, CPU offloading can extend capacity. The trade-off is latency—offloading weights to CPU RAM increases per-token latency by 10-50x.
# CPU offloading strategy
offload_config = {
"embedding_layers": "always_gpu",
"early_layers": "prefer_gpu",
"expert_layers": "prefer_gpu",
"late_layers": "cpu_if_pressure",
"lm_head": "always_gpu",
}
# Enables serving larger models at acceptable latency
# for non-latency-critical batch reasoning
Profile a reasoning request through R1's memory usage. Identify the bottleneck (weights, KV cache, activations) and propose a specific optimization that addresses it.