RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Understanding AI Models
  6. /Ch. 15
Understanding AI Models

15. Model Selection for Reasoning

Chapter 15 of 20 · 15 min
KEY INSIGHT

Reasoning performance is not captured by knowledge benchmarks-test with multi-step problems that require maintaining intermediate state.

Reasoning tasks-math, logic puzzles, multi-step planning-stress models differently than knowledge retrieval. This chapter covers selection criteria for reasoning-heavy workloads.

Reasoning model requirements:

  1. Chain-of-thought: Maintain coherent reasoning across many steps
  2. Error recovery: Catch and correct mistakes mid-reasoning
  3. Working memory: Track multiple intermediate conclusions
  4. Verification: Check work against constraints

Benchmark comparison:

Model MMLU GSM8K MATH
Phi-3-medium 82% 78% 53%
Mistral 7B 71% 52% 25%
Llama 3.1 8B 68% 55% 35%
Gemma 2 9B 74% 65% 41%

These numbers show that MMLU does not predict reasoning performance-Phi-3-medium's strong MATH score comes from reasoning chain training, not pure knowledge.

Architecture effects on reasoning:

  • Long context: Enables maintaining reasoning chains (128K helps)
  • Attention patterns: Some models attend better to relevant context
  • Training data composition: Math/code heavy training improves reasoning

Testing reasoning directly:

def test_reasoning(model, problem):
    # Multi-step problem with verifiable answer
    prompt = f"""
    Problem: Alice has 5 apples. She gives Bob 2 more than half her apples.
             Bob then gives Charlie half of what he received.
             How many apples does Charlie have?
    
    Think step by step. Show your work. End with "Answer: X"
    """
    
    response = model.generate(prompt)
    
    # Check for reasoning steps
    has_steps = "step" in response.lower() or "first" in response.lower()
    
    # Extract answer
    answer = extract_final_number(response)
    
    return {
        "has_reasoning": has_steps,
        "answer": answer,
        "correct": answer == 3
    }

System prompt effects:

Some models respond better to reasoning prompts with explicit instructions:

You are a careful reasoner. Think through each step explicitly.
Show your work before giving the final answer.

Test with and without system prompts-the difference can be 10-20% on reasoning tasks.

EXERCISE

Create a 10-problem reasoning benchmark with 3-5 step problems. Test 3 models and analyze whether MMLU scores predict reasoning performance on your benchmark.

← Chapter 14
Model Selection for Code
Chapter 16 →
Instruct vs Base Models