Performance Benchmarking — Hybrid Local-Cloud AI Architecture (Chapter 17)

Performance benchmarking quantifies gateway behavior under controlled conditions. Without benchmarks, optimization efforts lack direction and regression goes undetected. Systematic benchmarking provides the empirical foundation for capacity planning and performance engineering.

Benchmark design requires workload characterization. Real traffic patterns inform test scenarios—request size distributions, concurrent load patterns, and model preference ratios. Synthetic benchmarks generate reproducible results but may not reflect production behavior. Hybrid approaches combine recorded traffic replay with synthetic stress testing.

import asyncio
from dataclasses import dataclass
from typing import Callable
import time

@dataclass
class BenchmarkResult:
    name: str
    total_requests: int
    successful: int
    failed: int
    p50_latency_ms: float
    p95_latency_ms: float
    p99_latency_ms: float
    throughput_rps: float

class BenchmarkRunner:
    def __init__(self, gateway: GatewayClient, config: BenchmarkConfig):
        self.gateway = gateway
        self.config = config
        self.results: list[BenchmarkResult] = []
    
    async def run_concurrent_benchmark(
        self, name: str, 
        workload: list[WorkloadRequest],
        concurrency: int
    ) -> BenchmarkResult:
        latencies: list[float] = []
        errors: int = 0
        
        semaphore = asyncio.Semaphore(concurrency)
        
        async def process_request(req: WorkloadRequest):
            async with semaphore:
                start = time.perf_counter()
                try:
                    await self.gateway.inference(req)
                    latency = (time.perf_counter() - start) * 1000
                    latencies.append(latency)
                except Exception:
                    nonlocal errors
                    errors += 1
        
        await asyncio.gather(*[process_request(r) for r in workload])
        
        latencies.sort()
        return BenchmarkResult(
            name=name,
            total_requests=len(workload),
            successful=len(latencies),
            failed=errors,
            p50_latency_ms=latencies[len(latencies)//2] if latencies else 0,
            p95_latency_ms=latencies[int(len(latencies)*0.95)] if latencies else 0,
            p99_latency_ms=latencies[int(len(latencies)*0.99)] if latencies else 0,
            throughput_rps=len(latencies) / self.config.duration_seconds
        )
    
    async def run_comparison(self, requests: list[WorkloadRequest],
                             providers: list[str]) -> dict[str, BenchmarkResult]:
        results = {}
        for provider in providers:
            self.gateway.set_provider(provider)
            result = await self.run_concurrent_benchmark(
                f"{provider}_benchmark",
                requests,
                self.config.concurrency
            )
            results[provider] = result
        return results

Benchmark runners should execute regularly—nightly at minimum—in CI/CD pipelines. Results compare against historical baselines; regressions trigger alerts before deployment. Long-running benchmark trends reveal performance degradation that accumulates gradually.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.