Model Selection for Reasoning — Understanding AI Models (Chapter 15)

Reasoning tasks-math, logic puzzles, multi-step planning-stress models differently than knowledge retrieval. This chapter covers selection criteria for reasoning-heavy workloads.

Reasoning model requirements:

Chain-of-thought: Maintain coherent reasoning across many steps
Error recovery: Catch and correct mistakes mid-reasoning
Working memory: Track multiple intermediate conclusions
Verification: Check work against constraints

Benchmark comparison:

Model	MMLU	GSM8K	MATH
Phi-3-medium	82%	78%	53%
Mistral 7B	71%	52%	25%
Llama 3.1 8B	68%	55%	35%
Gemma 2 9B	74%	65%	41%

These numbers show that MMLU does not predict reasoning performance-Phi-3-medium's strong MATH score comes from reasoning chain training, not pure knowledge.

Architecture effects on reasoning:

Long context: Enables maintaining reasoning chains (128K helps)
Attention patterns: Some models attend better to relevant context
Training data composition: Math/code heavy training improves reasoning

Testing reasoning directly:

def test_reasoning(model, problem):
    # Multi-step problem with verifiable answer
    prompt = f"""
    Problem: Alice has 5 apples. She gives Bob 2 more than half her apples.
             Bob then gives Charlie half of what he received.
             How many apples does Charlie have?
    
    Think step by step. Show your work. End with "Answer: X"
    """
    
    response = model.generate(prompt)
    
    # Check for reasoning steps
    has_steps = "step" in response.lower() or "first" in response.lower()
    
    # Extract answer
    answer = extract_final_number(response)
    
    return {
        "has_reasoning": has_steps,
        "answer": answer,
        "correct": answer == 3
    }

System prompt effects:

Some models respond better to reasoning prompts with explicit instructions:

You are a careful reasoner. Think through each step explicitly.
Show your work before giving the final answer.

Test with and without system prompts-the difference can be 10-20% on reasoning tasks.