RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to benchmark reasoning capabilities between DeepSeek-R1 and other models
HOW-TO · INF

How to benchmark reasoning capabilities between DeepSeek-R1 and other models

advanced·30 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

DeepSeek-R1 pulled via Ollama, at least one comparison model installed, reasoning test prompts prepared

What this does

Runs structured reasoning test prompts against DeepSeek-R1 and comparison models, evaluating chain-of-thought quality and final-answer accuracy. After this guide an objective comparison of reasoning performance across models will be available.

Steps

  1. Prepare a reasoning test set. Creates a file with diverse problems covering logic, math, and deduction.

    cat > /tmp/reasoning_tests.txt << 'EOF'
    A train leaves at 09:00 traveling 60 km/h. Another leaves at 10:30 traveling 90 km/h. When does the second catch the first?
    Three switches control three light bulbs in another room. You can only enter once. How do you determine which switch controls which bulb?
    If all Zogs are Mips, and some Mips are Rels, can we conclude some Zogs are Rels?
    EOF
    
  2. Run prompts against DeepSeek-R1 with extended context. Reasoning chains can be long; increase context window.

    ollama run deepseek-r1:7b "$(head -1 /tmp/reasoning_tests.txt)" --num-ctx 8192
    

    Expected output: Full response including visible thinking block.

  3. Run the same prompt against comparison models. Uses identical prompt and matching settings.

    ollama run qwen2.5:7b "$(head -1 /tmp/reasoning_tests.txt)" --num-ctx 8192
    

    Compare the final answer and reasoning path structure.

  4. Score responses on reasoning validity, completeness, and accuracy. Repeat across all test prompts.

Verification

ollama list | grep -E "deepseek-r1|qwen"
# Expected: Both model tags listed with size and date information

Common failures

  • too many tokens - Chain-of-thought exceeds context window; increase --num-ctx or use smaller test prompts.
  • truncated reasoning chain - Ollama may cut long outputs; use ollama run ... > output.txt to capture full response.
  • model not reasoning - Some models do not expose chain-of-thought by default; check stop sequences.
  • inconsistent temperature - Set temperature consistently with --temp 0.7 to avoid variability.
  • slow iteration - Running multiple 7B+ models sequentially is time-consuming; consider scripting the loop.

Related guides

  • How to compare model performance across different quantization levels
  • How to evaluate model response quality using predefined test cases
RELATED GUIDES
INF
How to compare model performance across different quantization levels
INF
How to evaluate model response quality using predefined test cases
← All how-to guidesCourses →