09. HumanEval for Code
HumanEval is the standard benchmark for code generation capability. Understanding its structure helps you evaluate whether a model fits your coding needs.
Benchmark construction:
HumanEval contains 164 human-written programming problems in Python. Each problem includes:
- Function signature
- Docstring with description
- Reference implementation (for verification)
- Test cases (public + private)
Problems range from simple list operations to complex algorithms requiring multi-step reasoning.
Evaluation methodology:
def evaluate_humaneval(model):
correct = 0
for problem in humaneval_dataset:
# Generate code from signature + docstring
generated = model.generate(problem.prompt)
# Extract Python code (between code fences if present)
code = extract_python_code(generated)
# Run tests
try:
exec(code)
passed = run_tests(problem, code)
if passed:
correct += 1
except:
pass # Syntax errors, runtime errors count as wrong
return correct / 164 # Pass@1 score
Pass@k metric:
Pass@k allows multiple generation attempts. If you generate k samples and any pass the tests, you count it as correct:
Pass@1: Generate once, must pass (standard reporting)
Pass@10: Generate 10, any pass counts
Pass@100: Generate 100, any pass counts
A model with 70% Pass@1 might have 90% Pass@10-useful to know for tasks where you can generate and check multiple solutions.
Score interpretation:
| Pass@1 | Interpretation |
|---|---|
| <20% | Struggles with basic Python |
| 20-40% | Can write simple scripts, fails on complex logic |
| 40-60% | Handles typical interview problems |
| 60-75% | Strong, handles nontrivial algorithms |
| >75% | Very strong, approaches training data contamination |
HumanEval limitations:
- Language bias: Only Python. Models may perform differently on JavaScript, Rust, Go.
- Problem style: Interview-style algorithms, not production code patterns.
- No external dependencies: Problems avoid complex library calls.
- Contamination risk: Training data may include HumanEval solutions.
Test a model on 5 HumanEval problems manually. Note which fail and why-syntax errors, logic errors, or misunderstanding the problem.