08. MMLU Benchmark Explained
MMLU (Massive Multitask Language Understanding) is the most cited academic benchmark. Understanding what it measures-and its limitations-helps you interpret model comparisons accurately.
Benchmark structure:
MMLU contains 57 subjects across multiple domains:
Subjects: mathematics, history, law, medicine, ethics, computer science...
Questions per subject: 100 (5-shot evaluation)
Format: 4-way multiple choice
The model receives 5 examples and then answers test questions without explicit instructions about the subject.
Scoring methodology:
def evaluate_mmlu(model, subjects):
total_correct = 0
total_questions = 0
for subject in subjects:
# 5-shot prompt with examples
prompt = build_few_shot_prompt(subject, examples=5)
for question in subject.test_questions:
answer = model.generate(prompt + question)
if answer == question.correct_answer:
total_correct += 1
total_questions += 1
return total_correct / total_questions # 0.0 to 1.0
Score interpretation:
| Score | Interpretation |
|---|---|
| <40% | Below random chance (model confused) |
| 40-50% | Below average human (67%) |
| 50-65% | Approaches non-expert human performance |
| 65-75% | Expert-level across many domains |
| >80% | Strong reasoning combined with broad knowledge |
What MMLU measures:
- Broad factual knowledge: The model must know facts across 57 subjects
- Comprehension: Extract relevant information from the question
- Reasoning: Eliminate obviously wrong answers
- Language understanding: Handle academic text in multiple domains
What MMLU does NOT measure:
- Creative writing ability
- Code generation
- Multi-step reasoning chains
- Task completion in open-ended scenarios
- Numerical computation
- Real-time knowledge (MMLU is frozen at training cutoff)
Common benchmark manipulation:
Some models overfit to MMLU through:
- Training data that includes MMLU questions
- Heavy math/code training that coincidentally helps multiple-choice reasoning
- Unusual prompting strategies optimized for the specific format
This is why seeing MMLU alongside HumanEval and live benchmarks provides better signal.
Look up MMLU scores for Llama 3.1, Mistral, and Phi-3. Note the spread and check if the ordering matches your expectations for model quality on real tasks.