Capacity Planning — Enterprise-Scale RAG (Chapter 18)

Capacity planning predicts infrastructure requirements based on expected load, data growth, and performance targets. Under-provisioning causes latency spikes; over-provisioning wastes budget.

The planning process starts with load characterization:

from dataclasses import dataclass
from collections import defaultdict
import numpy as np

@dataclass
class QueryProfile:
    avg_latency_ms: float
    p99_latency_ms: float
    queries_per_minute: int
    avg_result_size_mb: float
    cache_hit_rate: float

@dataclass
class DocumentProfile:
    total_documents: int
    avg_chunk_size_tokens: int
    chunks_per_document: int
    embedding_dimension: int
    daily_new_documents: int

def calculate_vector_index_size(profile: DocumentProfile) -> dict:
    """Calculate memory and storage requirements for vector index"""
    total_chunks = profile.total_documents * profile.chunks_per_document
    
    # Float32 embedding: 4 bytes per dimension
    embedding_bytes_per_chunk = profile.embedding_dimension * 4
    
    # Metadata overhead (JSON, IDs, vectors): ~2x embedding size
    metadata_multiplier = 3.0
    
    # Index overhead (HNSW graph structure): ~1.5x raw embeddings
    index_overhead = 1.5
    
    raw_size_bytes = total_chunks * embedding_bytes_per_chunk * metadata_multiplier
    indexed_size_bytes = raw_size_bytes * index_overhead
    
    return {
        "raw_embeddings_gb": raw_size_bytes / (1024**3),
        "indexed_size_gb": indexed_size_bytes / (1024**3),
        "in_memory_size_gb": indexed_size_bytes / (1024**3),  # for HNSW
        "total_chunks": total_chunks,
        "growth_per_day_gb": (profile.daily_new_documents * 
                              profile.chunks_per_document * 
                              embedding_bytes_per_chunk * metadata_multiplier * 
                              index_overhead) / (1024**3)
    }

Throughput planning for Redis/Qdrant:

def calculate_redis_throughput_requirements(qpm: int, p99_target_ms: float) -> dict:
    """Determine Redis instance sizing for given query rate"""
    
    # HNSW search takes ~1-5ms per query
    # Redis single-threaded processes 100k+ ops/sec if simple
    # But vector similarity is expensive: ~500-2000 ops/sec per core
    
    ops_per_query = 2  # 1 KNN + 1 metadata fetch
    estimated_ops_per_second = (qpm / 60) * ops_per_query
    
    # For p99 < 50ms with 1000 QPM: need 3-4 cores, 8GB+ RAM
    # For p99 < 20ms with 10000 QPM: cluster with sharding
    
    core_count = max(2, int(np.ceil(estimated_ops_per_second / 2000)))
    ram_gb = 8 + (qpm / 1000) * 2  # Rule of thumb
    
    return {
        "recommended_cores": core_count,
        "recommended_ram_gb": ram_gb,
        "estimated_replicas_for_ha": 2,
        "estimated_annual_cost": ram_gb * 120 * 12  # rough estimate
    }

Failure Modes:

Ignoring index rebuild overhead: Full index rebuild (after crash or config change) requires 2-3x working memory temporarily.
Underestimating concurrent connections: Each connection consumes memory. 10K concurrent users may need 500-1000 connections.
Memory fragmentation: Over time, Redis memory becomes fragmented. Plan for 20-30% overhead.
Seasonal spikes: Holiday traffic may be 10x normal. Pre-warming or auto-scaling required.

Build headroom into capacity plans—target 70% utilization under normal load to absorb spikes.