12. GSM8K and MATH Benchmarks
GSM8K and MATH are the primary benchmarks for evaluating mathematical reasoning in language models. Understanding what they measure—and what they don't—prevents misinterpreting benchmark results.
GSM8K: Grade School Mathematics
The Grade School Math 8K dataset contains 8,500 problems from 5th through 8th-grade curricula. These are two-step to eight-step arithmetic and algebra problems that humans can solve without calculators.
GSM8K tests basic mathematical reasoning. Problems are intentionally simple:
Sarah has 3 dogs and 2 cats. Each dog eats 2 treats per day.
Each cat eats 1 treat per day. How many treats does Sarah
need for her pets in one week?
Correct reasoning requires multiplying dogs × treats × days for the dog total, adding cat treats × days for cats, then summing. A model that jumps directly to "21" without showing intermediate multiplication is likely memorized rather than reasoning.
MATH: Competition Mathematics
The MATH dataset contains 12,500 problems from math competitions (AMC, AIME, IMO difficulty levels). These problems require multi-step reasoning, mathematical insight, and often non-obvious algebraic manipulations.
# MATH problem difficulty levels
MATH_DIFFICULTY = {
1: "Training/Elementary",
2: "High School Basic",
3: "High School Intermediate",
4: "High School Advanced",
5: "Competition Problems"
}
# Score reporting format
example_score = {
'level': 3,
'accuracy': 0.67,
'subject_id': 'algebra',
'problem_id': 'math_5001'
}
MATH's five difficulty levels reveal granular capability profiles. A model scoring 90% at level 1 and 40% at level 3 has narrow capabilities—a problem that standard accuracy reporting would obscure.
Why These Benchmarks Matter (And Why They Don't)
GSM8K and MATH became standards because they revealed reasoning capabilities that previous benchmarks missed. GPT-3 achieved only 5% on GSM8K; GPT-4 achieved 92%. This 87-point jump demonstrated that scale + chain-of-thought unlocked genuine mathematical capability.
The benchmarks matter for tracking progress. They don't matter as absolute capability measures because:
- Contamination: Training data includes benchmark problems. Reported numbers are upper bounds.
- Format sensitivity: Models perform differently with varied reasoning prompt formats.
- Coverage gaps: Non-math reasoning (spatial, causal, temporal) goes unmeasured.
Interpreting Benchmark Results
When evaluating model claims based on these benchmarks:
- Ask what format was used (chain-of-thought, tool integration, ensemble)
- Check if results are averaged across difficulty levels
- Look for generalization testing, not just held-out splits
# Common benchmark reporting pattern (check what you're NOT seeing)
# "90% on MATH" often means:
# - Average across all levels (levels 1-3 dominate)
# - With chain-of-thought prompting enabled
# - On the test split (leakage from train split possible)
# - Without strict format requirements
What benchmarks miss: Real-world reasoning involves messy documentation, ambiguous requirements, and multi-modal inputs. Math benchmarks are clean by design. A 95% MATH performance says nothing about extracting information from inconsistent logs or forming valid hypotheses from partial data.
Run the same set of 20 problems through your reasoning model twice: once with chain-of-thought explicitly enabled, once without. Record the accuracy difference and note how often the reasoning structure differs between runs. A gap suggests the model relies on prompting rather than inherent reasoning capability.