10. GSM8K for Math
GSM8K (Grade School Math 8K) tests multi-step arithmetic reasoning. Understanding this benchmark helps evaluate models for mathematical tasks.
Benchmark structure:
GSM8K contains 8,500 grade-school math problems requiring 2-8 reasoning steps. Problems use only elementary arithmetic, no advanced math.
Example problem:
Maria buys 3 packs of stickers. Each pack has 12 stickers.
She gives 15 stickers to her friend. How many does she have left?
Solution: 3 x 12 = 36, 36 - 15 = 21
Answer: 21
Why this benchmark matters:
These problems test:
- Multi-step arithmetic (each step can fail)
- Maintaining intermediate state
- Verifying work (catching errors before final answer)
- Common sense about quantities
A model that gets 90% on MMLU may get 50% on GSM8K if it cannot maintain reasoning chains.
Evaluation methodology:
def evaluate_gsm8k(model):
correct = 0
for problem in gsm8k_dataset:
# Generate solution (prompt includes "Step-by-step" instruction)
solution = model.generate(
f"Problem: {problem.question}\
"
"Solve step-by-step. State your final answer as: "
"Answer: <number>"
)
# Extract final answer
extracted = extract_answer(solution)
if extracted == problem.answer:
correct += 1
return correct / len(gsm8k_dataset)
Math-specific evaluation issues:
Models often get the reasoning right but extract the answer incorrectly:
Model output: "Therefore Maria has 21 stickers remaining."
Extraction looks for "Answer: 21" but sees "21" in different format.
Fails despite correct reasoning.
Many benchmarks use loose matching or re-run the solution to verify correctness.
Score interpretation:
| GSM8K | Interpretation |
|---|---|
| <20% | Cannot maintain multi-step reasoning |
| 20-40% | Single-step reasoning, fails on chains |
| 40-60% | 2-3 step chains, arithmetic errors |
| 60-80% | Strong reasoning, occasional mistakes |
| >80% | Very strong, some prompting sensitivity |
Beyond GSM8K:
More challenging math benchmarks:
- MATH: Competition math (LaTeX, complex notation)
- GSM-Plus: Harder variations of GSM8K
- ARC-Challenge: Math reasoning without heavy calculation
Find 5 GSM8K problems the model gets wrong. Analyze whether failures are in arithmetic, reasoning, or answer extraction.