12. Running Your Own Benchmarks
Chapter 12 of 20 · 20 min
Public benchmarks have contamination risk and may not match your use case. Running your own benchmarks gives you accurate capability data for your specific needs.
Benchmark design principles:
- Match your task distribution: If you analyze legal documents, benchmark on legal text, not general writing.
- Separate evaluation from development: Do not adjust your model based on evaluation results.
- Use consistent prompting: Vary prompts systematically, not arbitrarily.
- Measure both quality and latency: A fast bad model is different from a slow good model.
Basic benchmarking setup:
# benchmark_runner.py
import json
from typing import Callable
def run_benchmark(
model,
test_cases: list[dict],
scorer: Callable[[str, dict], float],
max_tokens: int = 512
) -> dict:
results = []
for case in test_cases:
response = model.generate(
case["prompt"],
max_tokens=max_tokens
)
score = scorer(response, case)
results.append({
"case_id": case["id"],
"prompt": case["prompt"],
"response": response,
"expected": case.get("reference"),
"score": score
})
# Aggregate statistics
return {
"mean_score": sum(r["score"] for r in results) / len(results),
"median_score": sorted(r["score"] for r in results)[len(results)//2],
"pass_rate": sum(1 for r in results if r["score"] > threshold) / len(results),
"latency_p50": compute_percentile([r["latency"] for r in results], 50),
"latency_p95": compute_percentile([r["latency"] for r in results], 95),
"per_case": results
}
Code evaluation benchmark:
def score_code(response: str, case: dict) -> float:
# Extract code from response
code = extract_code(response)
try:
# Run against test cases
exec_globals = {}
exec(code, exec_globals)
passed = 0
for test in case["tests"]:
result = eval(test["call"], exec_globals)
if result == test["expected"]:
passed += 1
return passed / len(case["tests"])
except SyntaxError:
return 0.0
except Exception as e:
return 0.0
Synthetic benchmark generation:
# Generate synthetic test cases if you lack data
def generate_reasoning_benchmark(count: int, difficulty: str):
prompts = []
for i in range(count):
# Create problems with known answers
numbers = random.sample(range(1, 100), 3)
prompt = f"What is {numbers[0]} + {numbers[1]} x {numbers[2]}?"
answer = numbers[0] + (numbers[1] * numbers[2])
prompts.append({
"id": f"synthetic_{i}",
"prompt": prompt,
"expected": str(answer),
"category": "arithmetic",
"difficulty": difficulty
})
return prompts
EXERCISE
Create a 10-question benchmark for your primary use case (code, analysis, creative writing). Run it against 2-3 models and compare results to public benchmarks.