10. Training Reasoning Models
Large language models don't innate reason—they're trained to reason. Understanding how reasoning capabilities emerge during training helps operators make better deployment decisions and diagnose performance issues.
The Reinforced Hard Problem
Traditional language model training optimizes next-token prediction across human-generated text. Reasoning tasks follow different dynamics: the model must commit to intermediate steps that don't directly predict the final answer. A chain-of-thought response requires the model to generate logically consistent tokens that serve as explicit cognitive steps.
The core training challenge is that reasoning traces are sparse. For a math problem, billions of tokens of general text exist, but high-quality multi-step derivations are rare. DeepSeek R1 addressed this through cold-start fine-tuning before reinforcement learning—a deliberate approach to inject reasoning structure into the base model.
Distillation vs. End-to-End RL
Two primary training approaches exist for reasoning capabilities:
Distillation trains a smaller model on reasoning traces generated by a larger model. This approach is cheaper but inherits the teacher's limitations. The smaller model learns surface patterns in the reasoning chains rather than underlying logic.
End-to-end reinforcement learning trains the model to discover effective reasoning strategies independently. DeepSeek R1 uses GRPO (Group Relative Policy Optimization), which generates multiple reasoning responses for each problem and updates the policy based on relative quality rather than absolute rewards.
# Minimal GRPO-style update (conceptual)
def grpo_update(model, problem, reward_fn, group_size=16):
"""Group Relative Policy Optimization step"""
responses = []
for _ in range(group_size):
response = model.generate(problem)
responses.append(response)
rewards = [reward_fn(problem, r) for r in responses]
# Normalize relative to group mean
reward_mean = sum(rewards) / len(rewards)
reward_std = (sum((r - reward_mean)**2 for r in rewards) / len(rewards)) ** 0.5
advantages = [(r - reward_mean) / (reward_std + 1e-8) for r in rewards]
# Policy update prioritizes high-advantage responses
policy_loss = compute_policy_loss(model, responses, advantages)
return policy_loss
The Format Chaos Problem
When training purely with RL, models develop erratic reasoning formats—mixing languages, inserting decorative symbols, producing reasoning chains with no clear relationship to the answer. A common failure mode is reward hacking: the model learns to maximize the reward function without producing genuinely useful reasoning.
DeepSeek R1 addressed format chaos by including format rewards (structured output) alongside outcome rewards (correctness). This is why R1 outputs typically follow consistent markers like "where Y is the number of items"—trained behavior, not architectural constraint.
What Operators Need to Monitor
During any reasoning model deployment, watch for:
- Reasoning length collapse: Model starts giving single-sentence responses after repeated queries
- Format drift: Reasoning markers disappear, model returns direct answers
- Consistency degradation: Model fails problems similar to ones it previously solved
These symptoms indicate training instability or context saturation.
Evaluate a reasoning model's response by checking three properties: (1) Is each reasoning step self-contained? (2) Do intermediate steps logically connect? (3) Does the final answer directly follow from the reasoning chain? Rate each property 0-2 and note that scores below 4 indicate genuine reasoning problems.