RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Tasks/Text/Reasoning & Math
Text
math
logical reasoning
chain of thought
cot
step by step

Reasoning & Math

Multi-step logical reasoning, mathematical problem-solving, and symbolic manipulation. Distinguished from general chat by chain-of-thought trace quality and accuracy on AIME/GSM8K-class benchmarks.

Capability notes

Reasoning models add a chain-of-thought (CoT) pass before answering — the model "thinks" in tokens. This improves accuracy on math (MATH benchmark: [DeepSeek V4](/models/deepseek-v4) scores 92-94 vs 76-80 for standard 70B models), formal logic, competitive programming, and multi-step planning. The tradeoff: reasoning tokens cost 2-8× the final output length. Three tiers define the landscape. **Distilled reasoning models** (7B-70B, e.g. [DeepSeek R1 Distill Llama 8B](/models/deepseek-r1-distill-llama-8b)) are small models fine-tuned on reasoning traces from a larger teacher. They reason in math and code but lack generalization. **Prompt-based reasoning** ([Llama 3.3 70B](/models/llama-3-3-70b), [Qwen 3 32B](/models/qwen-3-32b)) uses system prompts ("think step by step") to coax CoT. On GSM8K, prompted Llama 3.3 70B hits 90-94 vs DeepSeek R1's 97+. Reliability varies unpredictably across problem types. **Native reasoning models** ([DeepSeek V4](/models/deepseek-v4), [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b)) have reasoning built into architecture — they auto-detect when to reason and produce more reliable traces. DeepSeek V4 leads open-weight at AIME mathematics (80-85 pass@1). **When reasoning helps**: multi-step math proofs, competitive programming (Codeforces Div2 D+), legal analysis with counterfactual logic, debugging requiring root-cause analysis across files, scientific problem-solving. **When reasoning wastes tokens**: simple classification, known-fact lookup, single-step translation, straightforward summarization, creative "list three options" tasks. At API pricing ($1-2/M reasoning tokens), every reasoning query costs 3-10× a non-reasoning equivalent. GSM8K (grade-school math) is largely saturated — distilled 8B models hit 85+. MATH and AIME meaningfully separate reasoning tiers.

If you just want to try this

Lowest-friction path to a working setup.

Start with [DeepSeek R1 Distill Llama 8B](/models/deepseek-r1-distill-llama-8b) on [Ollama](/tools/ollama). This is the smallest reasoning model that produces coherent CoT — under 5 GB at Q4, runs on an [RTX 3060 12GB](/hardware/rtx-3060-12gb) or CPU-only on a modern laptop. Run `ollama pull deepseek-r1:8b`, ask a math/logic question. The `<think>` block shows the reasoning trace before the answer. The model solves most single-step math problems and simple logic puzzles. Expect failure on 5+ step reasoning chains — the 8B parameter budget limits depth regardless of the CoT wrapper. When it gets a problem wrong, the reasoning trace reveals where the chain broke (mid-step arithmetic error, lost variable, assumption collapse). If you want stronger reasoning without hardware upgrades, [Llama 3.3 70B](/models/llama-3-3-70b) on [Ollama](/tools/ollama) with a system prompt of "Think through this step by step. Show your work before answering." produces reliable CoT on most problems. Requires ~40 GB combined memory at Q4 — achievable on [RTX 4090](/hardware/rtx-4090) with partial offload or [MacBook Pro 16 M4 Max 64GB](/hardware/macbook-pro-16-m4-max). Don't start with full DeepSeek V4 or Qwen 3 235B as a beginner. The multi-GPU hardware requirement (192 GB+ VRAM) is massive and you won't learn how reasoning models work faster than with the 8B distill.

For production deployment

Operator-grade recommendation.

Production reasoning pipelines require choosing self-hosted vs API reasoning based on query volume, latency needs, and reasoning depth. **Self-hosted with [vLLM](/tools/vllm)**: Serving reasoning models requires vLLM's reasoning token streaming — users see CoT traces as they generate, improving perceived latency. For [DeepSeek V4](/models/deepseek-v4)-class models, plan 4-8× [H100 PCIe](/hardware/nvidia-h100-pcie) or 2-4× [MI300X](/hardware/amd-mi300x). MoE means ~37-50B active parameters per token, but all experts must be loaded across the cluster. **Reasoning token economics**: Frontier APIs charge 3-10× standard rates for reasoning tokens. At $2/M reasoning input and $10/M output, a single complex problem generating 20K reasoning tokens costs $0.20+. At 100K queries/month, this is $20K+/month. Self-hosting breaks even faster for reasoning workloads than standard inference because the reasoning premium amplifies API costs. **When to use API reasoning**: (1) Bursty workloads with 90% idle hardware. (2) When reasoning quality must be frontier — closed-source APIs (Claude extended thinking, OpenAI o3) lead open-weight on AIME and GPQA by 2-5 points. (3) When you lack the ops team for multi-GPU MoE serving. [Cursor](/tools/cursor) and [Cline](/tools/cline) default to API reasoning for this reason — the maintenance cost of self-hosted frontier reasoning exceeds the API bill for teams under 50 engineers. **Monitoring reasoning quality**: (a) Trace truncation — the model hits max_tokens mid-thought producing garbage output. Set max_reasoning_tokens separately from max_output_tokens. (b) Reasoning collapse — CoT reduces to "thinking..." with no actual steps. (c) Infinite loops — common on MoE where expert routing oscillates. Log reasoning traces for debugging (2-8× output storage cost). **Agentic reasoning costs**: Agents that chain multiple reasoning calls (plan → execute → reflect → replan) amplify tokens 3-5× per task. Budget 10K-50K reasoning tokens per agentic task. [Aider](/tools/aider) with DeepSeek V4 uses ~15K reasoning tokens per SWE-bench code change.

What breaks

Failure modes operators see in the wild.

- **Confident wrong reasoning.** The model produces a detailed, step-by-step trace arriving at a wrong conclusion. Every intermediate step looks plausible, but one contains a factual error or logical leap that cascades. Mitigation: run a verifier pass — feed answer + trace to a second model (or the same model in non-reasoning mode) asking "is this reasoning valid?" Catches ~30-50% of confident errors. - **Infinite reasoning loops.** The model generates reasoning tokens indefinitely, circling the same logic tree without converging. The <think> block fills the context window with no final answer. Common on paradoxes and underspecified problems. Mitigation: hard-cap reasoning tokens at 4-8× expected output length. Monitor reasoning-to-output ratio; alert if it exceeds 10:1. - **Reasoning collapse on ambiguous problems.** The model sees the problem as unsolvable and outputs a one-line trace ("problem is ambiguous, cannot solve"). Indistinguishable from a lazy failure to the user. Mitigation: prompt with "if underspecified, state assumptions and proceed with the most reasonable interpretation." - **Token budget explosion on long-context reasoning.** A 60K-token document analyzed with multi-step logic generates 20K-40K reasoning tokens. Combined input + reasoning + output exceeds context window mid-generation. Mitigation: chunk input, reason per chunk, synthesize. Map-reduce reasoning — more compute but prevents context overflow. - **Politeness decay under reasoning pressure.** When token-constrained on hard problems, models skip normative responses — safety disclaimers, polite framing — because reasoning consumed the social-norm budget. Mitigation: reserve 100-200 output tokens for normative framing, separate from reasoning budget. - **Tool-use reasoning deadlock.** A reasoning model with tools (search, code execution) reasons itself into a state needing tool output to continue but hasn't called the tool yet because it's "still thinking." The agent times out. Mitigation: enforce a reasoning-token trigger that forces tool calls after N tokens — built into [Cline](/tools/cline) and [Aider](/tools/aider) but not default in raw API calls.

Hardware guidance

**Hobbyist ($500-$1,500)**: [RTX 3060 12GB](/hardware/rtx-3060-12gb) or [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb). Runs distilled reasoning models (DeepSeek R1 Distill Llama 8B) at Q4-Q5, 40-80 tok/s. 7-8B models fit 12-16 GB comfortably. Cannot run 70B reasoning distills — they need ~40 GB at Q4. CPU-only viable with [llama.cpp](/tools/llama-cpp) for 8B distills on 16+ GB RAM at 10-20 tok/s. **SMB ($2,000-$4,000)**: [RTX 4090 24GB](/hardware/rtx-4090) or [RTX 5090 32GB](/hardware/rtx-5090). Runs Llama 3.3 70B with reasoning prompts at Q4 (~40 GB combined, partial offload on 4090, full-VRAM on 5090). 15-25 tok/s for 70B. RTX 5090 32GB is the single-card sweet spot for 70B Q4 reasoning with 16K context. Also runs Qwen 3 32B at Q8 with CoT prompts at 50-80 tok/s. **Enterprise ($8,000-$25,000)**: 2× [RTX 5090](/hardware/rtx-5090) (64 GB total) or [RTX A6000](/hardware/rtx-a6000) 48 GB. Runs Qwen 3 32B at Q8 with 32K+ context, or 70B at Q8 for maximum reasoning quality per token. [Mac Studio M3 Ultra 192GB](/hardware/mac-studio-m3-ultra) is the unique Apple Silicon path to DeepSeek V4 Q2-Q3 on a single desktop — 3-8 tok/s, functional but slow. No CUDA alternative hits this memory ceiling on a single device. **Frontier ($50,000+)**: 4-8× [H100 PCIe](/hardware/nvidia-h100-pcie) or 2-4× [MI300X](/hardware/amd-mi300x). Required for DeepSeek V4 full MoE at FP8 or Qwen 3 235B at full precision for production serving with reasonable latency. [NVIDIA H200](/hardware/nvidia-h200) 141 GB enables 1-2 card DeepSeek V4 Q4 — minimum viable frontier reasoning for a team of 5-20 developers.

Runtime guidance

**If running distilled reasoning (<70B) for individual use** → [Ollama](/tools/ollama). Supports native <think> tag rendering, shows CoT traces inline. Works on macOS (Metal), Windows (CUDA), Linux (CUDA/ROCm). Zero configuration. **If running 70B+ reasoning on multi-GPU server** → [vLLM](/tools/vllm). Tensor parallelism across GPUs for models exceeding one card. Efficient MoE expert routing for DeepSeek V4-class. Configurable reasoning token streaming via server-sent events. Set `--max-model-len` to account for reasoning tokens beyond output. Chunked prefill is essential for long-context reasoning — prevents blocking during 60K-token prefills. **If on Apple Silicon** → [MLX LM](/tools/mlx-lm). Uses Metal directly, yielding 15-30% better throughput than llama.cpp Metal backend. Better memory management for variable-length reasoning+answer vs llama.cpp's fixed buffers. [LM Studio](/tools/lm-studio) bundles MLX with reasoning trace rendering — simplest Apple Silicon experience. **If maximizing throughput on NVIDIA** → [TensorRT-LLM](/tools/tensorrt-llm). Builds optimized engine files per model+GPU. Inflight batching benefits reasoning workloads — different trace lengths across concurrent requests amortize queue times. Build cost: 10-30 minutes per model; worth it for sustained serving. **If building agentic reasoning loops** → [Aider](/tools/aider) with `--reasoning-effort` flag, or [Cline](/tools/cline) with Claude extended thinking. Their plan→execute→reflect loops separate reasoning from action tokens. For open-weight: pair a reasoning model (DeepSeek V4 via vLLM) with a fast tool-calling model (Qwen 3 32B) — reasoning model plans, tool-calling model executes.

Setup walkthrough

  1. Install Ollama → ollama pull deepseek-r1:14b (~9 GB — distilled reasoning model).
  2. ollama run deepseek-r1:14b and type a math problem: "If a train leaves Station A at 60 mph and another leaves Station B at 80 mph 30 minutes later, 200 miles apart, when do they meet?"
  3. The model will output its chain-of-thought (invisible by default, visible with /set verbose) and then the answer. First response in 5-15 seconds on 12 GB GPU.
  4. For harder problems (AIME, Olympiad math): ollama pull deepseek-r1:32b (~20 GB, requires 24 GB GPU).
  5. For deep reasoning without distillation loss: ollama pull qwen-3-235b-a22b (MoE, ~140 GB — needs multi-GPU or quantized to ~50 GB).
  6. Evaluate with pip install lm-evaluation-harness → lm_eval --model local-completions --model_args model=deepseek-r1:14b,base_url=http://localhost:11434/v1/completions --tasks gsm8k.

The cheap setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs DeepSeek R1 Distill Llama 8B at 50-80 tok/s or Qwen 7B at 40-60 tok/s. These distilled models handle GSM8K-level math competently (85-90% accuracy). Pair with Ryzen 5 5600 + 32 GB DDR4 + 512 GB NVMe. Total: ~$400-480. For AIME-level reasoning, you need 32B+ models which require 24 GB — the $300 budget can't do AIME-grade reasoning. Be honest: $300 gets you competent high-school math, not Olympiad.

The serious setup

Used RTX 3090 24 GB ($700-900, see /hardware/rtx-3090). Runs DeepSeek R1 Distill Qwen 32B at 15-25 tok/s or DeepSeek V3 (non-reasoning but very capable) at 15-20 tok/s. GSM8K 90%+, AIME 50-70%. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: ~$1,800-2,200. For frontier reasoning: dual RTX 3090 (48 GB total) runs Qwen-3-235B-A22B IQ4_XS at 5-10 tok/s — AIME 80%+, competitive with closed-source frontier models. RTX 5090 32 GB ($2,000, see /hardware/rtx-5090) is the single-GPU reasoning king.

Common beginner mistake

The mistake: Using a non-reasoning model (like Llama 3.1 8B) for math problems that require multi-step logic, then blaming "the model is bad at math." Why it fails: Standard chat models generate left-to-right without explicit intermediate reasoning. They'll confidently output wrong answers because they can't backtrack or verify their steps. The fix: Use a chain-of-thought reasoning model (DeepSeek R1 distillations, Qwen 3 with thinking enabled). These models are trained to output reasoning traces before the final answer — they self-correct and verify along the way. Even a 7B reasoning model beats a 32B non-reasoning model on multi-step math. Always check if your model supports /think or thinking mode.

Recommended setup for reasoning & math

Recommended hardware
  • NVIDIA H200 →
  • AMD Instinct MI300X →
  • Best GPU for local AI →
Recommended runtimes
  • SGLang →
  • vLLM →
Budget build
AI PC under $1,000 →
Best GPU for this task
Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running reasoning & math locally. Each links to a diagnose+fix walkthrough.

  • CUDA out of memory →
  • Model keeps crashing →
  • Ollama running slow →
  • llama.cpp too slow →

Before you buy

Verify your specific hardware can handle reasoning & math before committing money.

  • Will it run on my hardware? →
  • Custom compatibility check →
  • GPU recommender (4 questions) →

Featured models

Qwen 3 235B-A22BDeepSeek V3 (671B MoE)DeepSeek V4

Featured hardware

NVIDIA H200AMD Instinct MI300X

Featured runtimes

SGLangvLLM
Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
  • Will it run on my hardware? →
Compare hardware
  • Curated head-to-heads →
  • Custom comparison tool →
  • RTX 4090 vs RTX 5090 →
  • RTX 3090 vs RTX 4090 →
Troubleshooting
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Specialized buyer guides
  • GPU for ComfyUI (image-gen) →
  • GPU for KoboldCpp (RP/long-context) →
  • GPU for AI agents →
  • GPU for local OCR →
  • GPU for voice cloning →
  • Upgrade from RTX 3060 →
  • Beginner setup →
  • AI PC for students →
Updated 2026 roundup
  • Best free local AI tools (2026) →