15. Model Selection for Reasoning
Reasoning tasks-math, logic puzzles, multi-step planning-stress models differently than knowledge retrieval. This chapter covers selection criteria for reasoning-heavy workloads.
Reasoning model requirements:
- Chain-of-thought: Maintain coherent reasoning across many steps
- Error recovery: Catch and correct mistakes mid-reasoning
- Working memory: Track multiple intermediate conclusions
- Verification: Check work against constraints
Benchmark comparison:
| Model | MMLU | GSM8K | MATH |
|---|---|---|---|
| Phi-3-medium | 82% | 78% | 53% |
| Mistral 7B | 71% | 52% | 25% |
| Llama 3.1 8B | 68% | 55% | 35% |
| Gemma 2 9B | 74% | 65% | 41% |
These numbers show that MMLU does not predict reasoning performance-Phi-3-medium's strong MATH score comes from reasoning chain training, not pure knowledge.
Architecture effects on reasoning:
- Long context: Enables maintaining reasoning chains (128K helps)
- Attention patterns: Some models attend better to relevant context
- Training data composition: Math/code heavy training improves reasoning
Testing reasoning directly:
def test_reasoning(model, problem):
# Multi-step problem with verifiable answer
prompt = f"""
Problem: Alice has 5 apples. She gives Bob 2 more than half her apples.
Bob then gives Charlie half of what he received.
How many apples does Charlie have?
Think step by step. Show your work. End with "Answer: X"
"""
response = model.generate(prompt)
# Check for reasoning steps
has_steps = "step" in response.lower() or "first" in response.lower()
# Extract answer
answer = extract_final_number(response)
return {
"has_reasoning": has_steps,
"answer": answer,
"correct": answer == 3
}
System prompt effects:
Some models respond better to reasoning prompts with explicit instructions:
You are a careful reasoner. Think through each step explicitly.
Show your work before giving the final answer.
Test with and without system prompts-the difference can be 10-20% on reasoning tasks.
Create a 10-problem reasoning benchmark with 3-5 step problems. Test 3 models and analyze whether MMLU scores predict reasoning performance on your benchmark.