02. Distributed Architecture
A production RAG system cannot run on a single server. At minimum, you need independent scaling for embedding, storage, retrieval, and inference. Each component has different resource profiles—embeddings are CPU-bound during ingestion but retrieval is memory-bound; inference is GPU-bound with unpredictable batch sizes.
The foundational pattern is a tiered storage architecture. Hot storage (Redis, Memcached) holds frequently accessed embeddings and recent context caches. Warm storage (SSD-backed vector DB) handles standard retrieval. Cold storage (object storage) preserves original documents and archived chunks for compliance.
# Typical tiered deployment structure
services:
embedding-service:
replicas: 4
resources:
cpu: "4"
memory: "8Gi"
vector-store:
replicas: 3
resources:
cpu: "2"
memory: "32Gi"
volumes:
- warm-data:/data
cache-layer:
replicas: 2
resources:
cpu: "1"
memory: "16Gi"
The cache layer is not optional. Without it, repeated queries for popular documents hammer your vector database. A FAQ query that appears 10,000 times per hour becomes a vector search 10,000 times per hour—or 1 time per hour with a 3,600-second TTL.
Consistency becomes complicated. When you update a document, the vector store must reflect the change within your freshness SLA. But vector databases don't support transactions across indexes. Updating "document version 3" requires deleting old embeddings and inserting new ones—during which time queries return either stale or missing results.
Partitioning strategies matter enormously. Sharding embeddings by document type, tenant, or date range enables parallel retrieval but complicates cross-partition queries. Tenant isolation requires sharding by organization—your vector database needs to know which embeddings belong to which tenant before searching.
The network topology creates hidden latency. If your embedding service runs in us-east-1 and your vector database runs in eu-west-1, every ingest operation crosses the Atlantic. At 100,000 chunks per day, cross-region latency adds 200+ hours of cumulative delay.
Failure modes include split-brain scenarios where cache and vector store diverge, cascading timeouts when one service slows down, and partition events that isolate entire document collections from queries.
Design the shard key for a multi-tenant RAG system with these requirements: (1) 100 tenants, (2) each tenant queries only their own documents, (3) queries must complete in under 500ms at p99, (4) vector index must support 10M embeddings per tenant.