RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Hybrid Local-Cloud AI Architecture
  6. /Ch. 17
Hybrid Local-Cloud AI Architecture

17. Performance Benchmarking

Chapter 17 of 18 · 15 min
KEY INSIGHT

Benchmarking without comparison is measurement without meaning. Establish baselines, track trends, and react to regressions to maintain consistent performance.

Performance benchmarking quantifies gateway behavior under controlled conditions. Without benchmarks, optimization efforts lack direction and regression goes undetected. Systematic benchmarking provides the empirical foundation for capacity planning and performance engineering.

Benchmark design requires workload characterization. Real traffic patterns inform test scenarios—request size distributions, concurrent load patterns, and model preference ratios. Synthetic benchmarks generate reproducible results but may not reflect production behavior. Hybrid approaches combine recorded traffic replay with synthetic stress testing.

import asyncio
from dataclasses import dataclass
from typing import Callable
import time

@dataclass
class BenchmarkResult:
    name: str
    total_requests: int
    successful: int
    failed: int
    p50_latency_ms: float
    p95_latency_ms: float
    p99_latency_ms: float
    throughput_rps: float

class BenchmarkRunner:
    def __init__(self, gateway: GatewayClient, config: BenchmarkConfig):
        self.gateway = gateway
        self.config = config
        self.results: list[BenchmarkResult] = []
    
    async def run_concurrent_benchmark(
        self, name: str, 
        workload: list[WorkloadRequest],
        concurrency: int
    ) -> BenchmarkResult:
        latencies: list[float] = []
        errors: int = 0
        
        semaphore = asyncio.Semaphore(concurrency)
        
        async def process_request(req: WorkloadRequest):
            async with semaphore:
                start = time.perf_counter()
                try:
                    await self.gateway.inference(req)
                    latency = (time.perf_counter() - start) * 1000
                    latencies.append(latency)
                except Exception:
                    nonlocal errors
                    errors += 1
        
        await asyncio.gather(*[process_request(r) for r in workload])
        
        latencies.sort()
        return BenchmarkResult(
            name=name,
            total_requests=len(workload),
            successful=len(latencies),
            failed=errors,
            p50_latency_ms=latencies[len(latencies)//2] if latencies else 0,
            p95_latency_ms=latencies[int(len(latencies)*0.95)] if latencies else 0,
            p99_latency_ms=latencies[int(len(latencies)*0.99)] if latencies else 0,
            throughput_rps=len(latencies) / self.config.duration_seconds
        )
    
    async def run_comparison(self, requests: list[WorkloadRequest],
                             providers: list[str]) -> dict[str, BenchmarkResult]:
        results = {}
        for provider in providers:
            self.gateway.set_provider(provider)
            result = await self.run_concurrent_benchmark(
                f"{provider}_benchmark",
                requests,
                self.config.concurrency
            )
            results[provider] = result
        return results

Benchmark runners should execute regularly—nightly at minimum—in CI/CD pipelines. Results compare against historical baselines; regressions trigger alerts before deployment. Long-running benchmark trends reveal performance degradation that accumulates gradually.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Create a benchmark suite that tests your gateway with 1000 requests across varying concurrency levels (1, 10, 50, 100). Generate latency distribution plots and identify the throughput ceiling.

← Chapter 16
Security Boundaries
Chapter 18 →
Hybrid Gateway Project