20. Load Testing
Chapter 20 of 24 · 15 min
Load testing validates system behavior under realistic concurrent load, revealing bottlenecks and failure modes that don't appear in single-request benchmarks.
Locust provides a Python-based load testing framework:
from locust import HttpUser, task, between
import json
class RAGLoadUser(HttpUser):
wait_time = between(0.5, 2.0) # User think time
weight = 10 # 10x more common than admin tasks
def on_start(self):
# Load test queries from file
with open("/test/fixtures/queries.jsonl") as f:
self.queries = [json.loads(line) for line in f]
self.query_idx = 0
@task
def search_and_ask(self):
query = self.queries[self.query_idx % len(self.queries)]
self.query_idx += 1
# Search for context
with self.client.post(
"/api/v1/search",
json={"query": query, "limit": 5},
catch_response=True
) as search_resp:
if search_resp.status_code == 200:
search_data = search_resp.json()
search_resp.success()
# Generate response
context_ids = [r["id"] for r in search_data["results"]]
with self.client.post(
"/api/v1/generate",
json={
"query": query,
"context_ids": context_ids
},
catch_response=True
) as gen_resp:
if gen_resp.status_code == 200:
gen_resp.success()
else:
gen_resp.failure(f"Generation failed: {gen_resp.status_code}")
else:
search_resp.failure(f"Search failed: {search_resp.status_code}")
class AdminLoadUser(HttpUser):
wait_time = between(5.0, 15.0) # Less frequent
weight = 1
@task
def ingest_document(self):
with open("/test/fixtures/sample_doc.json") as f:
doc = json.load(f)
with self.client.post(
"/api/v1/ingest",
json=doc,
catch_response=True
) as resp:
if resp.status_code in (200, 201):
resp.success()
else:
resp.failure(f"Ingest failed: {resp.status_code}")
Run the load test:
locust -f locustfile.py \
--host=https://rag-prod.internal \
--users=1000 \
--spawn-rate=50 \
--run-time=15m \
--headless \
--csv=/results/load_test_$(date +%Y%m%d_%H%M%S)
Failure Modes:
- Hitting rate limits: Production LLM APIs throttle requests. Account for API limits in concurrency settings (typically 500-1000 RPM per account).
- Resource exhaustion during ramp-up: Sudden load spikes trigger connection pool exhaustion or OOM. Use gradual
--spawn-rateincreases. - Database connection saturation: Each concurrent user holds a connection. 1000 users with 10 connections each means 10,000 active connections—exceeds most DB limits.
- Ignoring error rate: Load tests that measure latency only ignore failed requests. Track error rates as a primary metric.
Meaningful load tests run for at least 15 minutes to capture steady-state behavior and warm-cache effects.
EXERCISE
Configure Locust with 100 concurrent users ramping up over 2 minutes. Identify the concurrent user count where p99 latency exceeds 500ms. Document the bottleneck.