08. MMLU Benchmark Explained

Chapter 8 of 20 · 15 min

MMLU (Massive Multitask Language Understanding) is the most cited academic benchmark. Understanding what it measures-and its limitations-helps you interpret model comparisons accurately.

Benchmark structure:

MMLU contains 57 subjects across multiple domains:

Subjects: mathematics, history, law, medicine, ethics, computer science...
Questions per subject: 100 (5-shot evaluation)
Format: 4-way multiple choice

The model receives 5 examples and then answers test questions without explicit instructions about the subject.

Scoring methodology:

def evaluate_mmlu(model, subjects):
    total_correct = 0
    total_questions = 0
    
    for subject in subjects:
        # 5-shot prompt with examples
        prompt = build_few_shot_prompt(subject, examples=5)
        
        for question in subject.test_questions:
            answer = model.generate(prompt + question)
            if answer == question.correct_answer:
                total_correct += 1
            total_questions += 1
    
    return total_correct / total_questions  # 0.0 to 1.0

Score interpretation:

Score Interpretation
<40% Below random chance (model confused)
40-50% Below average human (67%)
50-65% Approaches non-expert human performance
65-75% Expert-level across many domains
>80% Strong reasoning combined with broad knowledge

What MMLU measures:

  1. Broad factual knowledge: The model must know facts across 57 subjects
  2. Comprehension: Extract relevant information from the question
  3. Reasoning: Eliminate obviously wrong answers
  4. Language understanding: Handle academic text in multiple domains

What MMLU does NOT measure:

  • Creative writing ability
  • Code generation
  • Multi-step reasoning chains
  • Task completion in open-ended scenarios
  • Numerical computation
  • Real-time knowledge (MMLU is frozen at training cutoff)

Common benchmark manipulation:

Some models overfit to MMLU through:

  • Training data that includes MMLU questions
  • Heavy math/code training that coincidentally helps multiple-choice reasoning
  • Unusual prompting strategies optimized for the specific format

This is why seeing MMLU alongside HumanEval and live benchmarks provides better signal.

EXERCISE

Look up MMLU scores for Llama 3.1, Mistral, and Phi-3. Note the spread and check if the ordering matches your expectations for model quality on real tasks.