Benchmarking — Custom Agent Frameworks (Chapter 19)

Performance optimization without measurement is guesswork. Benchmarks give you the data to know what to fix and whether your changes help.

Measuring Agent Throughput

import time
import asyncio
from dataclasses import dataclass
from typing import Callable

@dataclass
class BenchmarkResult:
    operation: str
    iterations: int
    total_time_seconds: float
    avg_latency_ms: float
    min_latency_ms: float
    max_latency_ms: float
    throughput_per_second: float

async def benchmark_agent_operation(
    agent,
    operation_fn: Callable,
    iterations: int = 1000,
    warmup: int = 100
) -> BenchmarkResult:
    # Warmup
    for _ in range(warmup):
        await operation_fn()
    
    latencies: list[float] = []
    start = time.perf_counter()
    
    for _ in range(iterations):
        iter_start = time.perf_counter()
        await operation_fn()
        latencies.append((time.perf_counter() - iter_start) * 1000)
    
    total_time = time.perf_counter() - start
    
    return BenchmarkResult(
        operation=operation_fn.__name__,
        iterations=iterations,
        total_time_seconds=total_time,
        avg_latency_ms=sum(latencies) / len(latencies),
        min_latency_ms=min(latencies),
        max_latency_ms=max(latencies),
        throughput_per_second=iterations / total_time
    )

Memory Profiling

Memory leaks in long-running agents are common and devastating:

import sys
import tracemalloc

def get_memory_usage() -> dict:
    if not tracemalloc.is_tracing():
        return {"error": "Tracing not started"}
    
    current, peak = tracemalloc.get_traced_memory()
    return {
        "current_mb": round(current / 1024 / 1024, 2),
        "peak_mb": round(peak / 1024 / 1024, 2)
    }

async def benchmark_memory(operation_fn, iterations: int = 100):
    tracemalloc.start()
    
    baseline = get_memory_usage()
    
    for _ in range(iterations):
        await operation_fn()
    
    after = get_memory_usage()
    
    tracemalloc.stop()
    
    return {
        "baseline": baseline,
        "after": after,
        "leaked_mb": round(after["current_mb"] - baseline["current_mb"], 2)
    }

Comparison Framework

Track performance across versions to detect regressions:

from datetime import datetime

class BenchmarkTracker:
    def __init__(self, db_path: str):
        self.db_path = db_path  # SQLite database
    
    async def save_result(self, result: BenchmarkResult, version: str):
        # Store in SQLite for trend analysis
        conn = sqlite3.connect(self.db_path)
        conn.execute("""
            INSERT INTO benchmarks (version, operation, iterations, 
                                   avg_latency_ms, throughput, timestamp)
            VALUES (?, ?, ?, ?, ?, ?)
        """, (
            version, result.operation, result.iterations,
            result.avg_latency_ms, result.throughput_per_second,
            datetime.now().isoformat()
        ))
        conn.commit()
        conn.close()
    
    def get_trend(self, operation: str, last_n_versions: int = 10) -> list[dict]:
        conn = sqlite3.connect(self.db_path)
        rows = conn.execute("""
            SELECT version, avg_latency_ms, throughput 
            FROM benchmarks 
            WHERE operation = ?
            ORDER BY timestamp DESC
            LIMIT ?
        """, (operation, last_n_versions)).fetchall()
        conn.close()
        return [{"version": r[0], "latency_ms": r[1], "throughput": r[2]} for r in rows]

Knowledge transfer checkpoint

Connect Benchmarking back to the local-AI decision you are learning to make. The practical question is not only whether the code or concept works, but whether it still works when the model, runtime, hardware budget, privacy requirement, and latency target are real constraints.

Before moving on, write down four things: the local runtime or deployment surface involved, the memory or throughput constraint that could change the design, the verification signal that proves the lesson worked, and the failure mode you would check first if the result looked wrong. That turns this chapter from background knowledge into an operator habit.

A good answer should be specific enough that another reader could repeat the decision on their own machine. Name the model or component when there is one, record the relevant context or token budget, and prefer a measurable check over a vague statement such as "it seems faster" or "the setup is fine."