RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Understanding AI Models
  6. /Ch. 10
Understanding AI Models

10. GSM8K for Math

Chapter 10 of 20 · 20 min
KEY INSIGHT

GSM8K isolates multi-step reasoning without external knowledge-weak performance indicates reasoning chain failures.

GSM8K (Grade School Math 8K) tests multi-step arithmetic reasoning. Understanding this benchmark helps evaluate models for mathematical tasks.

Benchmark structure:

GSM8K contains 8,500 grade-school math problems requiring 2-8 reasoning steps. Problems use only elementary arithmetic, no advanced math.

Example problem:

Maria buys 3 packs of stickers. Each pack has 12 stickers.
She gives 15 stickers to her friend. How many does she have left?

Solution: 3 x 12 = 36, 36 - 15 = 21
Answer: 21

Why this benchmark matters:

These problems test:

  • Multi-step arithmetic (each step can fail)
  • Maintaining intermediate state
  • Verifying work (catching errors before final answer)
  • Common sense about quantities

A model that gets 90% on MMLU may get 50% on GSM8K if it cannot maintain reasoning chains.

Evaluation methodology:

def evaluate_gsm8k(model):
    correct = 0
    
    for problem in gsm8k_dataset:
        # Generate solution (prompt includes "Step-by-step" instruction)
        solution = model.generate(
            f"Problem: {problem.question}\
"
            "Solve step-by-step. State your final answer as: "
            "Answer: <number>"
        )
        
        # Extract final answer
        extracted = extract_answer(solution)
        
        if extracted == problem.answer:
            correct += 1
    
    return correct / len(gsm8k_dataset)

Math-specific evaluation issues:

Models often get the reasoning right but extract the answer incorrectly:

Model output: "Therefore Maria has 21 stickers remaining."
Extraction looks for "Answer: 21" but sees "21" in different format.
Fails despite correct reasoning.

Many benchmarks use loose matching or re-run the solution to verify correctness.

Score interpretation:

GSM8K Interpretation
<20% Cannot maintain multi-step reasoning
20-40% Single-step reasoning, fails on chains
40-60% 2-3 step chains, arithmetic errors
60-80% Strong reasoning, occasional mistakes
>80% Very strong, some prompting sensitivity

Beyond GSM8K:

More challenging math benchmarks:

  • MATH: Competition math (LaTeX, complex notation)
  • GSM-Plus: Harder variations of GSM8K
  • ARC-Challenge: Math reasoning without heavy calculation
EXERCISE

Find 5 GSM8K problems the model gets wrong. Analyze whether failures are in arithmetic, reasoning, or answer extraction.

← Chapter 9
HumanEval for Code
Chapter 11 →
Chatbot Arena Elo