Load Testing — Enterprise-Scale RAG (Chapter 20)

Load testing validates system behavior under realistic concurrent load, revealing bottlenecks and failure modes that don't appear in single-request benchmarks.

Locust provides a Python-based load testing framework:

from locust import HttpUser, task, between
import json

class RAGLoadUser(HttpUser):
    wait_time = between(0.5, 2.0)  # User think time
    weight = 10  # 10x more common than admin tasks
    
    def on_start(self):
        # Load test queries from file
        with open("/test/fixtures/queries.jsonl") as f:
            self.queries = [json.loads(line) for line in f]
        self.query_idx = 0
    
    @task
    def search_and_ask(self):
        query = self.queries[self.query_idx % len(self.queries)]
        self.query_idx += 1
        
        # Search for context
        with self.client.post(
            "/api/v1/search",
            json={"query": query, "limit": 5},
            catch_response=True
        ) as search_resp:
            if search_resp.status_code == 200:
                search_data = search_resp.json()
                search_resp.success()
                
                # Generate response
                context_ids = [r["id"] for r in search_data["results"]]
                with self.client.post(
                    "/api/v1/generate",
                    json={
                        "query": query,
                        "context_ids": context_ids
                    },
                    catch_response=True
                ) as gen_resp:
                    if gen_resp.status_code == 200:
                        gen_resp.success()
                    else:
                        gen_resp.failure(f"Generation failed: {gen_resp.status_code}")
            else:
                search_resp.failure(f"Search failed: {search_resp.status_code}")

class AdminLoadUser(HttpUser):
    wait_time = between(5.0, 15.0)  # Less frequent
    weight = 1
    
    @task
    def ingest_document(self):
        with open("/test/fixtures/sample_doc.json") as f:
            doc = json.load(f)
        
        with self.client.post(
            "/api/v1/ingest",
            json=doc,
            catch_response=True
        ) as resp:
            if resp.status_code in (200, 201):
                resp.success()
            else:
                resp.failure(f"Ingest failed: {resp.status_code}")

Run the load test:

locust -f locustfile.py \
  --host=https://rag-prod.internal \
  --users=1000 \
  --spawn-rate=50 \
  --run-time=15m \
  --headless \
  --csv=/results/load_test_$(date +%Y%m%d_%H%M%S)

Failure Modes:

Hitting rate limits: Production LLM APIs throttle requests. Account for API limits in concurrency settings (typically 500-1000 RPM per account).
Resource exhaustion during ramp-up: Sudden load spikes trigger connection pool exhaustion or OOM. Use gradual --spawn-rate increases.
Database connection saturation: Each concurrent user holds a connection. 1000 users with 10 connections each means 10,000 active connections—exceeds most DB limits.
Ignoring error rate: Load tests that measure latency only ignore failed requests. Track error rates as a primary metric.

Meaningful load tests run for at least 15 minutes to capture steady-state behavior and warm-cache effects.