12. Running Your Own Benchmarks

Chapter 12 of 20 · 20 min

Public benchmarks have contamination risk and may not match your use case. Running your own benchmarks gives you accurate capability data for your specific needs.

Benchmark design principles:

  1. Match your task distribution: If you analyze legal documents, benchmark on legal text, not general writing.
  2. Separate evaluation from development: Do not adjust your model based on evaluation results.
  3. Use consistent prompting: Vary prompts systematically, not arbitrarily.
  4. Measure both quality and latency: A fast bad model is different from a slow good model.

Basic benchmarking setup:

# benchmark_runner.py
import json
from typing import Callable

def run_benchmark(
    model, 
    test_cases: list[dict],
    scorer: Callable[[str, dict], float],
    max_tokens: int = 512
) -> dict:
    results = []
    
    for case in test_cases:
        response = model.generate(
            case["prompt"],
            max_tokens=max_tokens
        )
        
        score = scorer(response, case)
        results.append({
            "case_id": case["id"],
            "prompt": case["prompt"],
            "response": response,
            "expected": case.get("reference"),
            "score": score
        })
    
    # Aggregate statistics
    return {
        "mean_score": sum(r["score"] for r in results) / len(results),
        "median_score": sorted(r["score"] for r in results)[len(results)//2],
        "pass_rate": sum(1 for r in results if r["score"] > threshold) / len(results),
        "latency_p50": compute_percentile([r["latency"] for r in results], 50),
        "latency_p95": compute_percentile([r["latency"] for r in results], 95),
        "per_case": results
    }

Code evaluation benchmark:

def score_code(response: str, case: dict) -> float:
    # Extract code from response
    code = extract_code(response)
    
    try:
        # Run against test cases
        exec_globals = {}
        exec(code, exec_globals)
        
        passed = 0
        for test in case["tests"]:
            result = eval(test["call"], exec_globals)
            if result == test["expected"]:
                passed += 1
        
        return passed / len(case["tests"])
    except SyntaxError:
        return 0.0
    except Exception as e:
        return 0.0

Synthetic benchmark generation:

# Generate synthetic test cases if you lack data
def generate_reasoning_benchmark(count: int, difficulty: str):
    prompts = []
    
    for i in range(count):
        # Create problems with known answers
        numbers = random.sample(range(1, 100), 3)
        prompt = f"What is {numbers[0]} + {numbers[1]} x {numbers[2]}?"
        answer = numbers[0] + (numbers[1] * numbers[2])
        
        prompts.append({
            "id": f"synthetic_{i}",
            "prompt": prompt,
            "expected": str(answer),
            "category": "arithmetic",
            "difficulty": difficulty
        })
    
    return prompts
EXERCISE

Create a 10-question benchmark for your primary use case (code, analysis, creative writing). Run it against 2-3 models and compare results to public benchmarks.