How to benchmark reasoning capabilities between DeepSeek-R1 and other models
DeepSeek-R1 pulled via Ollama, at least one comparison model installed, reasoning test prompts prepared
What this does
Runs structured reasoning test prompts against DeepSeek-R1 and comparison models, evaluating chain-of-thought quality and final-answer accuracy. After this guide an objective comparison of reasoning performance across models will be available.
Steps
Prepare a reasoning test set. Creates a file with diverse problems covering logic, math, and deduction.
cat > /tmp/reasoning_tests.txt << 'EOF' A train leaves at 09:00 traveling 60 km/h. Another leaves at 10:30 traveling 90 km/h. When does the second catch the first? Three switches control three light bulbs in another room. You can only enter once. How do you determine which switch controls which bulb? If all Zogs are Mips, and some Mips are Rels, can we conclude some Zogs are Rels? EOFRun prompts against DeepSeek-R1 with extended context. Reasoning chains can be long; increase context window.
ollama run deepseek-r1:7b "$(head -1 /tmp/reasoning_tests.txt)" --num-ctx 8192Expected output: Full response including visible thinking block.
Run the same prompt against comparison models. Uses identical prompt and matching settings.
ollama run qwen2.5:7b "$(head -1 /tmp/reasoning_tests.txt)" --num-ctx 8192Compare the final answer and reasoning path structure.
Score responses on reasoning validity, completeness, and accuracy. Repeat across all test prompts.
Verification
ollama list | grep -E "deepseek-r1|qwen"
# Expected: Both model tags listed with size and date information
Common failures
- too many tokens - Chain-of-thought exceeds context window; increase
--num-ctxor use smaller test prompts. - truncated reasoning chain - Ollama may cut long outputs; use
ollama run ... > output.txtto capture full response. - model not reasoning - Some models do not expose chain-of-thought by default; check stop sequences.
- inconsistent temperature - Set temperature consistently with
--temp 0.7to avoid variability. - slow iteration - Running multiple 7B+ models sequentially is time-consuming; consider scripting the loop.