10. Benchmarking

Chapter 10 of 18 · 15 min

KEY INSIGHT

Effective benchmarking separates reproducible research from wishful thinking. Without rigorous evaluation, claims about system performance are anecdotes, not evidence. Benchmarking an AI research system requires measuring behavior across multiple dimensions: accuracy, latency, throughput, memory footprint, and failure modes. Each dimension matters for different deployment contexts. ### Setting Up Evaluation Infrastructure A reliable benchmark harness needs three components: standardized datasets, automated measurement, and result persistence. ```python # benchmark_runner.py import json import time import psutil from dataclasses import dataclass from typing import Callable from pathlib import Path @dataclass class BenchmarkResult: name: str latency_p50_ms: float latency_p95_ms: float latency_p99_ms: float throughput_tokens_per_sec: float peak_memory_mb: float error_rate: float total_requests: int class BenchmarkRunner: def __init__(self, output_dir: Path): self.output_dir = output_dir self.results: list[BenchmarkResult] = [] def run(self, name: str, fn: Callable, test_cases: list, iterations: int = 100): latencies = [] errors = 0 process = psutil.Process() for i in range(iterations): for case in test_cases: start = time.perf_counter() try: fn(case) elapsed = (time.perf_counter() - start) * 1000 latencies.append(elapsed) except Exception: errors += 1 latencies.sort() n = len(latencies) result = BenchmarkResult( name=name, latency_p50_ms=latencies[n // 2], latency_p95_ms=latencies[int(n * 0.95)], latency_p99_ms=latencies[int(n * 0.99)], throughput_tokens_per_sec=self._calculate_throughput(test_cases, iterations), peak_memory_mb=process.memory_info().rss / 1024 / 1024, error_rate=errors / (iterations * len(test_cases)), total_requests=iterations * len(test_cases) ) self.results.append(result) self._persist(result) return result def _persist(self, result: BenchmarkResult): path = self.output_dir / f"{result.name}.json" with open(path, 'w') as f: json.dump(asdict(result), f) ``` ### Common Benchmarking Failures **Survivorship bias** occurs when evaluating only successful outputs. Track error rates explicitly—systems that fail 5% of the time often get ignored, but 5% failure in production creates user frustration. **Benchmark leakage** happens when training data overlaps with evaluation data. Always maintain strict separation. Use holdout datasets that the system has never seen. **Warm-up omission** skews latency measurements. GPUs and CPU caches need initialization time. Discard the first 10-20 requests before measuring. **Small sample sizes** produce unreliable metrics. Aim for at least 100 measurements per metric. Variance in AI system outputs requires larger samples than deterministic systems. ### Benchmark Suite Design Create tiered benchmarks: - **Unit benchmarks**: Single operations (tokenization, embedding lookup) - **Integration benchmarks**: Multi-step pipelines - **End-to-end benchmarks**: Complete user workflows Track regressions by maintaining historical baselines. A 2% regression on a benchmark you run weekly is visible; one you run once per project is invisible.

EXERCISE

Implement a benchmark suite for your research system. Run at least 100 iterations on three benchmarks: a simple retrieval task, a multi-hop reasoning task, and a generation task. Document latency distribution and error rates. Identify which benchmark shows highest variance and investigate why.