19. Benchmarking
Performance optimization without measurement is guesswork. Benchmarks give you the data to know what to fix and whether your changes help.
Measuring Agent Throughput
import time
import asyncio
from dataclasses import dataclass
from typing import Callable
@dataclass
class BenchmarkResult:
operation: str
iterations: int
total_time_seconds: float
avg_latency_ms: float
min_latency_ms: float
max_latency_ms: float
throughput_per_second: float
async def benchmark_agent_operation(
agent,
operation_fn: Callable,
iterations: int = 1000,
warmup: int = 100
) -> BenchmarkResult:
# Warmup
for _ in range(warmup):
await operation_fn()
latencies: list[float] = []
start = time.perf_counter()
for _ in range(iterations):
iter_start = time.perf_counter()
await operation_fn()
latencies.append((time.perf_counter() - iter_start) * 1000)
total_time = time.perf_counter() - start
return BenchmarkResult(
operation=operation_fn.__name__,
iterations=iterations,
total_time_seconds=total_time,
avg_latency_ms=sum(latencies) / len(latencies),
min_latency_ms=min(latencies),
max_latency_ms=max(latencies),
throughput_per_second=iterations / total_time
)
Memory Profiling
Memory leaks in long-running agents are common and devastating:
import sys
import tracemalloc
def get_memory_usage() -> dict:
if not tracemalloc.is_tracing():
return {"error": "Tracing not started"}
current, peak = tracemalloc.get_traced_memory()
return {
"current_mb": round(current / 1024 / 1024, 2),
"peak_mb": round(peak / 1024 / 1024, 2)
}
async def benchmark_memory(operation_fn, iterations: int = 100):
tracemalloc.start()
baseline = get_memory_usage()
for _ in range(iterations):
await operation_fn()
after = get_memory_usage()
tracemalloc.stop()
return {
"baseline": baseline,
"after": after,
"leaked_mb": round(after["current_mb"] - baseline["current_mb"], 2)
}
Comparison Framework
Track performance across versions to detect regressions:
from datetime import datetime
class BenchmarkTracker:
def __init__(self, db_path: str):
self.db_path = db_path # SQLite database
async def save_result(self, result: BenchmarkResult, version: str):
# Store in SQLite for trend analysis
conn = sqlite3.connect(self.db_path)
conn.execute("""
INSERT INTO benchmarks (version, operation, iterations,
avg_latency_ms, throughput, timestamp)
VALUES (?, ?, ?, ?, ?, ?)
""", (
version, result.operation, result.iterations,
result.avg_latency_ms, result.throughput_per_second,
datetime.now().isoformat()
))
conn.commit()
conn.close()
def get_trend(self, operation: str, last_n_versions: int = 10) -> list[dict]:
conn = sqlite3.connect(self.db_path)
rows = conn.execute("""
SELECT version, avg_latency_ms, throughput
FROM benchmarks
WHERE operation = ?
ORDER BY timestamp DESC
LIMIT ?
""", (operation, last_n_versions)).fetchall()
conn.close()
return [{"version": r[0], "latency_ms": r[1], "throughput": r[2]} for r in rows]
Knowledge transfer checkpoint
Connect Benchmarking back to the local-AI decision you are learning to make. The practical question is not only whether the code or concept works, but whether it still works when the model, runtime, hardware budget, privacy requirement, and latency target are real constraints.
Before moving on, write down four things: the local runtime or deployment surface involved, the memory or throughput constraint that could change the design, the verification signal that proves the lesson worked, and the failure mode you would check first if the result looked wrong. That turns this chapter from background knowledge into an operator habit.
A good answer should be specific enough that another reader could repeat the decision on their own machine. Name the model or component when there is one, record the relevant context or token budget, and prefer a measurable check over a vague statement such as "it seems faster" or "the setup is fine."
Create benchmarks for an agent's critical path operations. Run the benchmarks before and after any optimization attempt. Verify that improvements are real and not measurement noise.