15. Load Testing

Chapter 15 of 18 · 15 min

KEY INSIGHT

Load testing reveals bottlenecks before production traffic exposes themΓÇösynthetic workloads simulate realistic traffic patterns and measure system behavior under stress. Load tests serve multiple purposes: identifying performance regressions, establishing baseline metrics, validating capacity planning, and uncovering race conditions. Without testing, production incidents reveal performance characteristics the hard way. `locust` provides Python-based load testing with distributed execution support. Test scripts define user behavior, wait times, and success criteria. Locust automatically scales simulated users across worker processes. ```python from locust import HttpUser, task, between, events import json class InferenceUser(HttpUser): wait_time = between(1, 3) def on_start(self): self.headers = { "Authorization": "Bearer test-key", "Content-Type": "application/json" } @task def completions(self): payload = { "model": "llama3.2:latest", "messages": [{"role": "user", "content": "What is load testing?"}], "temperature": 0.7 } with self.client.post( "/v1/chat/completions", json=payload, headers=self.headers, catch_response=True ) as response: if response.status_code == 200: data = response.json() if "content" in data: response.success() else: response.failure("Missing content field") elif response.status_code == 503: response.success() # Expected under load else: response.failure(f"Unexpected status: {response.status_code}") @events.init_command_line_parser.add_listener def add_custom_arguments(parser): parser.arg_parser.add_argument("--model", type=str, default="llama3.2:latest") ``` Run tests with increasing user counts to identify the saturation point. Monitor response time percentiles (p50, p95, p99) rather than averages. A p99 latency exceeding several seconds suggests queue buildup or resource contention. Target SLOs determine passing criteria. If the API must respond within 500ms for 95% of requests, the load test validates this threshold. Failed requests and timeout rates indicate capacity limits.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

EXERCISE

Create a load test that simulates realistic traffic patterns with varied model sizes and message lengths. Configure Locust to report response time histograms and identify the user count where p99 latency exceeds 1 second.