17. Performance Benchmarking
Performance benchmarking quantifies gateway behavior under controlled conditions. Without benchmarks, optimization efforts lack direction and regression goes undetected. Systematic benchmarking provides the empirical foundation for capacity planning and performance engineering.
Benchmark design requires workload characterization. Real traffic patterns inform test scenarios—request size distributions, concurrent load patterns, and model preference ratios. Synthetic benchmarks generate reproducible results but may not reflect production behavior. Hybrid approaches combine recorded traffic replay with synthetic stress testing.
import asyncio
from dataclasses import dataclass
from typing import Callable
import time
@dataclass
class BenchmarkResult:
name: str
total_requests: int
successful: int
failed: int
p50_latency_ms: float
p95_latency_ms: float
p99_latency_ms: float
throughput_rps: float
class BenchmarkRunner:
def __init__(self, gateway: GatewayClient, config: BenchmarkConfig):
self.gateway = gateway
self.config = config
self.results: list[BenchmarkResult] = []
async def run_concurrent_benchmark(
self, name: str,
workload: list[WorkloadRequest],
concurrency: int
) -> BenchmarkResult:
latencies: list[float] = []
errors: int = 0
semaphore = asyncio.Semaphore(concurrency)
async def process_request(req: WorkloadRequest):
async with semaphore:
start = time.perf_counter()
try:
await self.gateway.inference(req)
latency = (time.perf_counter() - start) * 1000
latencies.append(latency)
except Exception:
nonlocal errors
errors += 1
await asyncio.gather(*[process_request(r) for r in workload])
latencies.sort()
return BenchmarkResult(
name=name,
total_requests=len(workload),
successful=len(latencies),
failed=errors,
p50_latency_ms=latencies[len(latencies)//2] if latencies else 0,
p95_latency_ms=latencies[int(len(latencies)*0.95)] if latencies else 0,
p99_latency_ms=latencies[int(len(latencies)*0.99)] if latencies else 0,
throughput_rps=len(latencies) / self.config.duration_seconds
)
async def run_comparison(self, requests: list[WorkloadRequest],
providers: list[str]) -> dict[str, BenchmarkResult]:
results = {}
for provider in providers:
self.gateway.set_provider(provider)
result = await self.run_concurrent_benchmark(
f"{provider}_benchmark",
requests,
self.config.concurrency
)
results[provider] = result
return results
Benchmark runners should execute regularly—nightly at minimum—in CI/CD pipelines. Results compare against historical baselines; regressions trigger alerts before deployment. Long-running benchmark trends reveal performance degradation that accumulates gradually.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Create a benchmark suite that tests your gateway with 1000 requests across varying concurrency levels (1, 10, 50, 100). Generate latency distribution plots and identify the throughput ceiling.