Multi-step scientific reasoning across physics, chemistry, biology. GPQA + ScienceQA benchmark this. Frontier reasoning models lead.
ollama pull deepseek-r1:32b (20 GB) or ollama pull qwen-3-30b-a3b (18 GB — MoE, strong reasoning).ollama run deepseek-r1:32b → "A 2 kg block slides down a frictionless 30° incline. Calculate the acceleration and the time to slide 5 meters. Show your work step by step."pip install lm-evaluation-harness → test on GPQA, MMLU-Pro, ARC-Challenge to benchmark your local model against published results.Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs DeepSeek R1 Distill Llama 8B at 50-80 tok/s or Qwen 7B distill at 40-60 tok/s. These handle high-school to intro-college physics, chemistry, and biology problems competently (GPQA ~30-40%). For undergraduate-level scientific reasoning: the 14B distilled models (Qwen 14B) run at 25-35 tok/s with noticeably better multi-step reasoning. Pair with Ryzen 5 5600 + 32 GB DDR4 + 512 GB NVMe. Total: ~$400-480. $400 gets you competent undergrad science reasoning; graduate-level requires 32B+ models.
Used RTX 3090 24 GB ($700-900, see /hardware/rtx-3090). Runs DeepSeek R1 Distill Qwen 32B at 15-25 tok/s — handles graduate-level physics and chemistry problems (GPQA ~50-65%). For research-grade scientific reasoning: Qwen 3 235B MoE IQ4_XS (50 GB) on dual RTX 3090 (48 GB total, ~$1,600) at 5-10 tok/s — GPQA 70%+, near-frontier quality. Total: ~$1,800-2,500. Scientific reasoning benefits disproportionately from model scale — the jump from 7B to 32B to 235B is qualitative, not just quantitative. Each step unlocks a new tier of scientific problems.
The mistake: Using a non-reasoning chat model for scientific problem-solving, getting a confidently wrong answer, and citing it in a paper or homework. Why it fails: Standard LLMs don't do step-by-step verification. Asked "What's the pH of 0.1M HCl?" a chat model might say "pH = 1" (correct) or "pH = 0.1" (confusing concentration with pH) or "pH = 13" (confusing acid with base) — all with equal confidence. Without a reasoning trace, you can't tell which answers were reasoned and which were hallucinated. The fix: Use a model with explicit chain-of-thought reasoning (DeepSeek R1 distillation, Qwen 3 with thinking mode). These models output their reasoning before the answer. Read the reasoning — if the logic is garbage, the answer is garbage. Also: verify calculations independently (Wolfram Alpha, Python). The model is a reasoning partner, not a calculator — it makes arithmetic errors even when the logic is correct. Trust the reasoning trace, verify the numbers.
Browse all tools for runtimes that fit this workload.
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
The errors most operators hit when running scientific reasoning locally. Each links to a diagnose+fix walkthrough.
Verify your specific hardware can handle scientific reasoning before committing money.