13. Multi-Step Reasoning
Single-step inference can't solve complex problems. Multi-step reasoning decomposes complex problems into intermediate steps where each step is manageable and the final answer emerges from the chain.
Why Single-Step Fails
Language models predict each token based on all previous tokens. For a complex problem like "If a train leaves New York at 60mph and another leaves Chicago at 80mph, and they're 1000 miles apart, when do they meet?", a single-step approach requires the model to:
- Encode all numerical values
- Calculate relative speed
- Calculate meeting time
- State the answer
Any error in steps 1-3 produces a wrong answer with no recovery mechanism. Multi-step reasoning lets the model catch and self-correct at each intermediate step.
The Decomposition Principle
Effective decomposition follows problem structure. For math problems, decomposition mirrors mathematical hierarchy:
def decompose_math_problem(problem):
"""Identify decomposition points for math reasoning"""
steps = []
# Level 1: Identify what's being asked
steps.append(identify_target(problem))
# Level 2: Extract relevant quantities
quantities = extract_quantities(problem)
steps.append(quantities)
# Level 3: Identify relationships between quantities
relationships = identify_relationships(quantities, problem)
steps.append(relationships)
# Level 4: Apply appropriate transformations
transformations = plan_transformations(relationships)
steps.append(transformations)
# Level 5: Execute and combine
result = execute_transformations(transformations)
return result
Failure mode: Over-decomposition. Breaking problems into unnecessary micro-steps reduces coherence. The model loses track of logical connections between too many tiny steps. Start with 3-5 comprehensible steps per problem category, then refine based on error patterns.
Hierarchical vs. Sequential Reasoning
Multi-step reasoning can proceed hierarchically (generate sub-goals, then solve) or sequentially (solve one step, then the next):
| Approach | Strengths | Weaknesses |
|---|---|---|
| Hierarchical | Handles complex dependencies, clearer goal structure | Requires forward planning |
| Sequential | Simple to implement, easy to verify | Scaffolding errors propagate |
| Hybrid | Combines planning and execution | Higher complexity |
DeepSeek R1's training favored sequential reasoning with reinforcement learning discovering effective step patterns. This differs from chain-of-thought prompting, which instructs the model to produce steps—it emergent from training objectives.
Verification Between Steps
Multi-step reasoning isn't complete without verification between steps. Each intermediate result should be checked against constraints before proceeding:
def multi_step_solve(problem, max_steps=10):
results = []
current_state = initialize_state(problem)
for step in range(max_steps):
next_step = model.generate_step(current_state, problem)
results.append(next_step)
# Verify step validity before proceeding
if not verify_step(next_step, current_state, problem):
# Flag inconsistency rather than proceeding
return {"error": "step_invalid", "failed_step": step}
current_state = apply_step(current_state, next_step)
if is_terminal(current_state):
return {"solution": current_state, "steps": results}
return {"error": "exceeded_max_steps"}
This pattern prevents error propagation—when a step fails verification, you know exactly where the reasoning broke down rather than attributing the failure to an ambiguous final answer.
When Multi-Step Reasoning Fails
Multi-step reasoning degrades in specific conditions:
- Attention collapse: In very long chains, later steps lose relevance to early steps
- Confidence anchoring: Early errors lock in incorrect intermediate states that persist through remaining steps
- Dimensional confusion: Steps use inconsistent units or scales without noticing
For problems requiring more than 15 reasoning steps, consider tool integration or hierarchical decomposition rather than extending the sequential chain.
Count the reasoning steps in 10 problems from your deployment. Identify whether errors cluster in specific step positions (early, middle, late). If clustering exists, that step position likely requires different prompting or verification logic.