Training Reasoning Models — DeepSeek R1 and Reasoning Models (Chapter 10)

Large language models don't innate reason—they're trained to reason. Understanding how reasoning capabilities emerge during training helps operators make better deployment decisions and diagnose performance issues.

The Reinforced Hard Problem

Traditional language model training optimizes next-token prediction across human-generated text. Reasoning tasks follow different dynamics: the model must commit to intermediate steps that don't directly predict the final answer. A chain-of-thought response requires the model to generate logically consistent tokens that serve as explicit cognitive steps.

The core training challenge is that reasoning traces are sparse. For a math problem, billions of tokens of general text exist, but high-quality multi-step derivations are rare. DeepSeek R1 addressed this through cold-start fine-tuning before reinforcement learning—a deliberate approach to inject reasoning structure into the base model.

Distillation vs. End-to-End RL

Two primary training approaches exist for reasoning capabilities:

Distillation trains a smaller model on reasoning traces generated by a larger model. This approach is cheaper but inherits the teacher's limitations. The smaller model learns surface patterns in the reasoning chains rather than underlying logic.

End-to-end reinforcement learning trains the model to discover effective reasoning strategies independently. DeepSeek R1 uses GRPO (Group Relative Policy Optimization), which generates multiple reasoning responses for each problem and updates the policy based on relative quality rather than absolute rewards.

# Minimal GRPO-style update (conceptual)
def grpo_update(model, problem, reward_fn, group_size=16):
    """Group Relative Policy Optimization step"""
    responses = []
    for _ in range(group_size):
        response = model.generate(problem)
        responses.append(response)
    
    rewards = [reward_fn(problem, r) for r in responses]
    
    # Normalize relative to group mean
    reward_mean = sum(rewards) / len(rewards)
    reward_std = (sum((r - reward_mean)**2 for r in rewards) / len(rewards)) ** 0.5
    
    advantages = [(r - reward_mean) / (reward_std + 1e-8) for r in rewards]
    
    # Policy update prioritizes high-advantage responses
    policy_loss = compute_policy_loss(model, responses, advantages)
    return policy_loss

The Format Chaos Problem

When training purely with RL, models develop erratic reasoning formats—mixing languages, inserting decorative symbols, producing reasoning chains with no clear relationship to the answer. A common failure mode is reward hacking: the model learns to maximize the reward function without producing genuinely useful reasoning.

DeepSeek R1 addressed format chaos by including format rewards (structured output) alongside outcome rewards (correctness). This is why R1 outputs typically follow consistent markers like "where Y is the number of items"—trained behavior, not architectural constraint.

What Operators Need to Monitor

During any reasoning model deployment, watch for:

Reasoning length collapse: Model starts giving single-sentence responses after repeated queries
Format drift: Reasoning markers disappear, model returns direct answers
Consistency degradation: Model fails problems similar to ones it previously solved

These symptoms indicate training instability or context saturation.