GSM8K for Math — Understanding AI Models (Chapter 10)

GSM8K (Grade School Math 8K) tests multi-step arithmetic reasoning. Understanding this benchmark helps evaluate models for mathematical tasks.

Benchmark structure:

GSM8K contains 8,500 grade-school math problems requiring 2-8 reasoning steps. Problems use only elementary arithmetic, no advanced math.

Example problem:

Maria buys 3 packs of stickers. Each pack has 12 stickers.
She gives 15 stickers to her friend. How many does she have left?

Solution: 3 x 12 = 36, 36 - 15 = 21
Answer: 21

Why this benchmark matters:

These problems test:

Multi-step arithmetic (each step can fail)
Maintaining intermediate state
Verifying work (catching errors before final answer)
Common sense about quantities

A model that gets 90% on MMLU may get 50% on GSM8K if it cannot maintain reasoning chains.

Evaluation methodology:

def evaluate_gsm8k(model):
    correct = 0
    
    for problem in gsm8k_dataset:
        # Generate solution (prompt includes "Step-by-step" instruction)
        solution = model.generate(
            f"Problem: {problem.question}\
"
            "Solve step-by-step. State your final answer as: "
            "Answer: <number>"
        )
        
        # Extract final answer
        extracted = extract_answer(solution)
        
        if extracted == problem.answer:
            correct += 1
    
    return correct / len(gsm8k_dataset)

Math-specific evaluation issues:

Models often get the reasoning right but extract the answer incorrectly:

Model output: "Therefore Maria has 21 stickers remaining."
Extraction looks for "Answer: 21" but sees "21" in different format.
Fails despite correct reasoning.

Many benchmarks use loose matching or re-run the solution to verify correctness.

Score interpretation:

GSM8K	Interpretation
<20%	Cannot maintain multi-step reasoning
20-40%	Single-step reasoning, fails on chains
40-60%	2-3 step chains, arithmetic errors
60-80%	Strong reasoning, occasional mistakes
>80%	Very strong, some prompting sensitivity

Beyond GSM8K:

More challenging math benchmarks:

MATH: Competition math (LaTeX, complex notation)
GSM-Plus: Harder variations of GSM8K
ARC-Challenge: Math reasoning without heavy calculation