Evaluation metrics

Pass@k

Pass@k is a metric that measures the probability that at least one of k independently generated samples from a model contains a correct answer. For code generation tasks, a sample is correct if it passes all unit tests. Operators encounter Pass@k when evaluating model accuracy for coding benchmarks like HumanEval or MBPP. A higher k reduces variance but increases compute cost: generating k=100 samples costs 100x the inference budget. The metric is typically reported as Pass@1 (single-sample accuracy) or Pass@k for k=10 or k=100. It matters because a model with low Pass@1 but high Pass@100 might still be useful in workflows that rerank or filter candidates.

Deeper dive

Pass@k was popularized by OpenAI's Codex paper (2021) to evaluate code generation. The naive estimator (count correct samples / k) is biased; the standard unbiased estimator uses the number of correct samples c: Pass@k = 1 - (C(n-c, k) / C(n, k)), where n is total samples per problem (often n=200). For operators, this means that comparing Pass@k across papers requires knowing the estimator used. In practice, running Pass@k locally is expensive: evaluating a 7B model on HumanEval (164 problems) with k=100 requires 16,400 generations. Operators often use a smaller k (e.g., 10) or rely on published scores. The metric is also used outside code: for math reasoning (e.g., GSM8K) and open-ended generation, where 'correct' is defined by exact match or rubric.

Practical example

On HumanEval, Llama 3.1 8B scores ~70% Pass@1 and ~90% Pass@100. To reproduce this locally, you'd generate 100 completions per problem. With a 16 GB GPU, generating 100 samples for a single problem at 4K context takes ~30 seconds at ~30 tok/s. For all 164 problems, that's ~1.4 hours of continuous generation. Most operators instead run a subset (e.g., 10 problems) or use a smaller k like 10, which cuts time to ~8 minutes.

Workflow example

In llama.cpp, you can approximate Pass@k by running ./main -m model.gguf -p "<prompt>" -n 256 -t 8 -c 4096 --num-slots 100 to generate 100 samples. Then manually check correctness. For automated evaluation, tools like bigcode-evaluation-harness run the full benchmark. In vLLM, you'd use --num-scheduler-steps and a custom script to collect k samples per prompt. Most operators rely on published scores rather than running Pass@k locally due to the compute cost.

Reviewed by Fredoline Eruo. See our editorial policy.