How to verify chain-of-thought reasoning in R1 models
DeepSeek-R1 model running
What this does
DeepSeek-R1 generates a chain-of-thought (CoT) before producing a final answer. This guide shows how to inspect, validate, and extract reasoning traces for correctness.
Steps
Capture the full response including reasoning tags.
curl -s http://localhost:11434/api/generate \ -d '{"model": "deepseek-r1:32b", "prompt": "Solve: 3x + 7 = 22", "stream": false, "raw": true}' \ | jq -r '.response' > response.txtExtract the reasoning part between tags.
import re with open("response.txt") as f: text = f.read() reasoning = re.search(r'\[REASONING\](.*?)\[/REASONING\]', text, re.DOTALL) answer = re.search(r'\[/REASONING\]\s*(.*)', text, re.DOTALL) print("Reasoning steps:\n", reasoning.group(1) if reasoning else "Not found") print("Final answer:\n", answer.group(1).strip() if answer else "Not found")Verify logical consistency. Check that each reasoning step follows from the previous one. For "3x + 7 = 22", valid steps are:
- Subtract 7 from both sides → 3x = 15
- Divide by 3 → x = 5
Run a counterfactual test. Give an intentionally flawed premise to see if the model catches the issue:
"All birds can fly. Penguins are birds. Can penguins fly?"A well-reasoned response should note the contradiction.
Verification
# Expected: reasoning contains numbered steps, final answer matches ground truth
python extract_reasoning.py
# Output: Reasoning steps: 1. Subtract 7... 2. Divide by 3... | Final answer: x = 5
Common failures
- Missing reasoning tags: Older or distilled R1 variants may not output structured tags. Use
raw: truein the API call to see the full output. - Reasoning contradicts answer: Indicates model confusion. Re-run with
temperature: 0for deterministic behavior. - Truncated reasoning: Increase
num_ctxto 16384 to accommodate long chains.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.