What this does

Runs structured reasoning test prompts against DeepSeek-R1 and comparison models, evaluating chain-of-thought quality and final-answer accuracy. After this guide an objective comparison of reasoning performance across models will be available.

Steps

Prepare a reasoning test set. Creates a file with diverse problems covering logic, math, and deduction.

cat > /tmp/reasoning_tests.txt << 'EOF'
A train leaves at 09:00 traveling 60 km/h. Another leaves at 10:30 traveling 90 km/h. When does the second catch the first?
Three switches control three light bulbs in another room. You can only enter once. How do you determine which switch controls which bulb?
If all Zogs are Mips, and some Mips are Rels, can we conclude some Zogs are Rels?
EOF

Run prompts against DeepSeek-R1 with extended context. Reasoning chains can be long; increase context window.
```
ollama run deepseek-r1:7b "$(head -1 /tmp/reasoning_tests.txt)" --num-ctx 8192
```
Expected output: Full response including visible thinking block.
Run the same prompt against comparison models. Uses identical prompt and matching settings.
```
ollama run qwen2.5:7b "$(head -1 /tmp/reasoning_tests.txt)" --num-ctx 8192
```
Compare the final answer and reasoning path structure.
Score responses on reasoning validity, completeness, and accuracy. Repeat across all test prompts.

Verification

ollama list | grep -E "deepseek-r1|qwen"
# Expected: Both model tags listed with size and date information

Common failures

too many tokens - Chain-of-thought exceeds context window; increase --num-ctx or use smaller test prompts.
truncated reasoning chain - Ollama may cut long outputs; use ollama run ... > output.txt to capture full response.
model not reasoning - Some models do not expose chain-of-thought by default; check stop sequences.
inconsistent temperature - Set temperature consistently with --temp 0.7 to avoid variability.
slow iteration - Running multiple 7B+ models sequentially is time-consuming; consider scripting the loop.

How to benchmark reasoning capabilities between DeepSeek-R1 and other models

What this does

Steps

Verification

Common failures

Related guides