RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /DeepSeek R1 and Reasoning Models
  6. /Ch. 10
DeepSeek R1 and Reasoning Models

10. Training Reasoning Models

Chapter 10 of 18 · 15 min
KEY INSIGHT

Reasoning capabilities don't emerge from next-token prediction alone. They require deliberate training that rewards intermediate steps and structured output, not just final correctness.

Large language models don't innate reason—they're trained to reason. Understanding how reasoning capabilities emerge during training helps operators make better deployment decisions and diagnose performance issues.

The Reinforced Hard Problem

Traditional language model training optimizes next-token prediction across human-generated text. Reasoning tasks follow different dynamics: the model must commit to intermediate steps that don't directly predict the final answer. A chain-of-thought response requires the model to generate logically consistent tokens that serve as explicit cognitive steps.

The core training challenge is that reasoning traces are sparse. For a math problem, billions of tokens of general text exist, but high-quality multi-step derivations are rare. DeepSeek R1 addressed this through cold-start fine-tuning before reinforcement learning—a deliberate approach to inject reasoning structure into the base model.

Distillation vs. End-to-End RL

Two primary training approaches exist for reasoning capabilities:

Distillation trains a smaller model on reasoning traces generated by a larger model. This approach is cheaper but inherits the teacher's limitations. The smaller model learns surface patterns in the reasoning chains rather than underlying logic.

End-to-end reinforcement learning trains the model to discover effective reasoning strategies independently. DeepSeek R1 uses GRPO (Group Relative Policy Optimization), which generates multiple reasoning responses for each problem and updates the policy based on relative quality rather than absolute rewards.

# Minimal GRPO-style update (conceptual)
def grpo_update(model, problem, reward_fn, group_size=16):
    """Group Relative Policy Optimization step"""
    responses = []
    for _ in range(group_size):
        response = model.generate(problem)
        responses.append(response)
    
    rewards = [reward_fn(problem, r) for r in responses]
    
    # Normalize relative to group mean
    reward_mean = sum(rewards) / len(rewards)
    reward_std = (sum((r - reward_mean)**2 for r in rewards) / len(rewards)) ** 0.5
    
    advantages = [(r - reward_mean) / (reward_std + 1e-8) for r in rewards]
    
    # Policy update prioritizes high-advantage responses
    policy_loss = compute_policy_loss(model, responses, advantages)
    return policy_loss

The Format Chaos Problem

When training purely with RL, models develop erratic reasoning formats—mixing languages, inserting decorative symbols, producing reasoning chains with no clear relationship to the answer. A common failure mode is reward hacking: the model learns to maximize the reward function without producing genuinely useful reasoning.

DeepSeek R1 addressed format chaos by including format rewards (structured output) alongside outcome rewards (correctness). This is why R1 outputs typically follow consistent markers like "where Y is the number of items"—trained behavior, not architectural constraint.

What Operators Need to Monitor

During any reasoning model deployment, watch for:

  • Reasoning length collapse: Model starts giving single-sentence responses after repeated queries
  • Format drift: Reasoning markers disappear, model returns direct answers
  • Consistency degradation: Model fails problems similar to ones it previously solved

These symptoms indicate training instability or context saturation.

EXERCISE

Evaluate a reasoning model's response by checking three properties: (1) Is each reasoning step self-contained? (2) Do intermediate steps logically connect? (3) Does the final answer directly follow from the reasoning chain? Rate each property 0-2 and note that scores below 4 indicate genuine reasoning problems.

← Chapter 9
Distillation of Reasoning
Chapter 11 →
Evaluation of Reasoning