MMLU Benchmark Explained — Understanding AI Models (Chapter 8)

MMLU (Massive Multitask Language Understanding) is the most cited academic benchmark. Understanding what it measures-and its limitations-helps you interpret model comparisons accurately.

Benchmark structure:

MMLU contains 57 subjects across multiple domains:

Subjects: mathematics, history, law, medicine, ethics, computer science...
Questions per subject: 100 (5-shot evaluation)
Format: 4-way multiple choice

The model receives 5 examples and then answers test questions without explicit instructions about the subject.

Scoring methodology:

def evaluate_mmlu(model, subjects):
    total_correct = 0
    total_questions = 0
    
    for subject in subjects:
        # 5-shot prompt with examples
        prompt = build_few_shot_prompt(subject, examples=5)
        
        for question in subject.test_questions:
            answer = model.generate(prompt + question)
            if answer == question.correct_answer:
                total_correct += 1
            total_questions += 1
    
    return total_correct / total_questions  # 0.0 to 1.0

Score interpretation:

Score	Interpretation
<40%	Below random chance (model confused)
40-50%	Below average human (67%)
50-65%	Approaches non-expert human performance
65-75%	Expert-level across many domains
>80%	Strong reasoning combined with broad knowledge

What MMLU measures:

Broad factual knowledge: The model must know facts across 57 subjects
Comprehension: Extract relevant information from the question
Reasoning: Eliminate obviously wrong answers
Language understanding: Handle academic text in multiple domains

What MMLU does NOT measure:

Creative writing ability
Code generation
Multi-step reasoning chains
Task completion in open-ended scenarios
Numerical computation
Real-time knowledge (MMLU is frozen at training cutoff)

Common benchmark manipulation:

Some models overfit to MMLU through:

Training data that includes MMLU questions
Heavy math/code training that coincidentally helps multiple-choice reasoning
Unusual prompting strategies optimized for the specific format

This is why seeing MMLU alongside HumanEval and live benchmarks provides better signal.