Running Your Own Benchmarks — Understanding AI Models (Chapter 12)

Public benchmarks have contamination risk and may not match your use case. Running your own benchmarks gives you accurate capability data for your specific needs.

Benchmark design principles:

Match your task distribution: If you analyze legal documents, benchmark on legal text, not general writing.
Separate evaluation from development: Do not adjust your model based on evaluation results.
Use consistent prompting: Vary prompts systematically, not arbitrarily.
Measure both quality and latency: A fast bad model is different from a slow good model.

Basic benchmarking setup:

# benchmark_runner.py
import json
from typing import Callable

def run_benchmark(
    model, 
    test_cases: list[dict],
    scorer: Callable[[str, dict], float],
    max_tokens: int = 512
) -> dict:
    results = []
    
    for case in test_cases:
        response = model.generate(
            case["prompt"],
            max_tokens=max_tokens
        )
        
        score = scorer(response, case)
        results.append({
            "case_id": case["id"],
            "prompt": case["prompt"],
            "response": response,
            "expected": case.get("reference"),
            "score": score
        })
    
    # Aggregate statistics
    return {
        "mean_score": sum(r["score"] for r in results) / len(results),
        "median_score": sorted(r["score"] for r in results)[len(results)//2],
        "pass_rate": sum(1 for r in results if r["score"] > threshold) / len(results),
        "latency_p50": compute_percentile([r["latency"] for r in results], 50),
        "latency_p95": compute_percentile([r["latency"] for r in results], 95),
        "per_case": results
    }

Code evaluation benchmark:

def score_code(response: str, case: dict) -> float:
    # Extract code from response
    code = extract_code(response)
    
    try:
        # Run against test cases
        exec_globals = {}
        exec(code, exec_globals)
        
        passed = 0
        for test in case["tests"]:
            result = eval(test["call"], exec_globals)
            if result == test["expected"]:
                passed += 1
        
        return passed / len(case["tests"])
    except SyntaxError:
        return 0.0
    except Exception as e:
        return 0.0

Synthetic benchmark generation:

# Generate synthetic test cases if you lack data
def generate_reasoning_benchmark(count: int, difficulty: str):
    prompts = []
    
    for i in range(count):
        # Create problems with known answers
        numbers = random.sample(range(1, 100), 3)
        prompt = f"What is {numbers[0]} + {numbers[1]} x {numbers[2]}?"
        answer = numbers[0] + (numbers[1] * numbers[2])
        
        prompts.append({
            "id": f"synthetic_{i}",
            "prompt": prompt,
            "expected": str(answer),
            "category": "arithmetic",
            "difficulty": difficulty
        })
    
    return prompts