04. Chain-of-Thought in Reasoning
Chain-of-thought prompting was the first technique to show that language models could reason step-by-step when asked. Reasoning models like R1 have internalized this capability, but operators still influence how well it manifests. This chapter covers the mechanics of CoT and how to work with R1's native reasoning.
The Discovery of CoT
Standard prompting asks models to produce answers directly. Chain-of-thought prompting asks models to "think step by step" first. The technique emerged empirically—researchers noticed that forcing intermediate steps improved accuracy on math and logic problems. The hypothesis: explicit steps reduce the burden on working memory by externalizing the reasoning process.
# Direct prompting (baseline)
prompt = "What is 17 * 23?"
# Model likely fails or gives wrong answer
# CoT prompting
prompt = """
What is 17 * 23?
Think step by step.
"""
# Model shows work: 17 * 20 = 340, 17 * 3 = 51, total = 391
From Prompting to Internalized Behavior
R1 was trained to internalize CoT behavior through RL. Rather than relying on prompting to trigger step-by-step reasoning, the model has learned to do this autonomously. When you send a complex problem to R1, it generates reasoning tokens without being explicitly asked to "think step by step."
This has practical implications:
- Short prompts work; you don't need elaborate CoT scaffolding
- Excessive prompting can interfere with native reasoning
- You can still guide reasoning direction through prompt structure
Verifying Reasoning Quality
Because R1 exposes its reasoning chains, you can verify correctness before accepting outputs. This is valuable for high-stakes applications where wrong answers have real costs.
def verify_reasoning_chain(chain, problem_type):
"""Check reasoning chain for common failure modes"""
issues = []
# Check for assertion without verification
if re.search(r"Therefore,.*obviously", chain):
issues.append("Skipped verification step")
# Check for arithmetic errors (if problem involves math)
if problem_type == "math":
arithmetic_steps = extract_math_expressions(chain)
for step in arithmetic_steps:
if not verify_arithmetic(step):
issues.append(f"Arithmetic error in: {step}")
# Check for self-contradiction
if has_contradiction(chain):
issues.append("Reasoning chain self-contradicts")
return issues
Process ten complex queries through R1 and manually inspect the reasoning chains. Categorize the failure modes you observe. Are there patterns that suggest specific prompting adjustments?