18. Capacity Planning
Chapter 18 of 24 · 20 min
Capacity planning predicts infrastructure requirements based on expected load, data growth, and performance targets. Under-provisioning causes latency spikes; over-provisioning wastes budget.
The planning process starts with load characterization:
from dataclasses import dataclass
from collections import defaultdict
import numpy as np
@dataclass
class QueryProfile:
avg_latency_ms: float
p99_latency_ms: float
queries_per_minute: int
avg_result_size_mb: float
cache_hit_rate: float
@dataclass
class DocumentProfile:
total_documents: int
avg_chunk_size_tokens: int
chunks_per_document: int
embedding_dimension: int
daily_new_documents: int
def calculate_vector_index_size(profile: DocumentProfile) -> dict:
"""Calculate memory and storage requirements for vector index"""
total_chunks = profile.total_documents * profile.chunks_per_document
# Float32 embedding: 4 bytes per dimension
embedding_bytes_per_chunk = profile.embedding_dimension * 4
# Metadata overhead (JSON, IDs, vectors): ~2x embedding size
metadata_multiplier = 3.0
# Index overhead (HNSW graph structure): ~1.5x raw embeddings
index_overhead = 1.5
raw_size_bytes = total_chunks * embedding_bytes_per_chunk * metadata_multiplier
indexed_size_bytes = raw_size_bytes * index_overhead
return {
"raw_embeddings_gb": raw_size_bytes / (1024**3),
"indexed_size_gb": indexed_size_bytes / (1024**3),
"in_memory_size_gb": indexed_size_bytes / (1024**3), # for HNSW
"total_chunks": total_chunks,
"growth_per_day_gb": (profile.daily_new_documents *
profile.chunks_per_document *
embedding_bytes_per_chunk * metadata_multiplier *
index_overhead) / (1024**3)
}
Throughput planning for Redis/Qdrant:
def calculate_redis_throughput_requirements(qpm: int, p99_target_ms: float) -> dict:
"""Determine Redis instance sizing for given query rate"""
# HNSW search takes ~1-5ms per query
# Redis single-threaded processes 100k+ ops/sec if simple
# But vector similarity is expensive: ~500-2000 ops/sec per core
ops_per_query = 2 # 1 KNN + 1 metadata fetch
estimated_ops_per_second = (qpm / 60) * ops_per_query
# For p99 < 50ms with 1000 QPM: need 3-4 cores, 8GB+ RAM
# For p99 < 20ms with 10000 QPM: cluster with sharding
core_count = max(2, int(np.ceil(estimated_ops_per_second / 2000)))
ram_gb = 8 + (qpm / 1000) * 2 # Rule of thumb
return {
"recommended_cores": core_count,
"recommended_ram_gb": ram_gb,
"estimated_replicas_for_ha": 2,
"estimated_annual_cost": ram_gb * 120 * 12 # rough estimate
}
Failure Modes:
- Ignoring index rebuild overhead: Full index rebuild (after crash or config change) requires 2-3x working memory temporarily.
- Underestimating concurrent connections: Each connection consumes memory. 10K concurrent users may need 500-1000 connections.
- Memory fragmentation: Over time, Redis memory becomes fragmented. Plan for 20-30% overhead.
- Seasonal spikes: Holiday traffic may be 10x normal. Pre-warming or auto-scaling required.
Build headroom into capacity plans—target 70% utilization under normal load to absorb spikes.
EXERCISE
Given 5M documents with 10 chunks each at 1536-dim embeddings, calculate required RAM and project growth for 12 months at 50K new documents daily.