Capability notes
Reasoning models add a chain-of-thought (CoT) pass before answering — the model "thinks" in tokens. This improves accuracy on math (MATH benchmark: [DeepSeek V4](/models/deepseek-v4) scores 92-94 vs 76-80 for standard 70B models), formal logic, competitive programming, and multi-step planning. The tradeoff: reasoning tokens cost 2-8× the final output length.
Three tiers define the landscape. **Distilled reasoning models** (7B-70B, e.g. [DeepSeek R1 Distill Llama 8B](/models/deepseek-r1-distill-llama-8b)) are small models fine-tuned on reasoning traces from a larger teacher. They reason in math and code but lack generalization. **Prompt-based reasoning** ([Llama 3.3 70B](/models/llama-3-3-70b), [Qwen 3 32B](/models/qwen-3-32b)) uses system prompts ("think step by step") to coax CoT. On GSM8K, prompted Llama 3.3 70B hits 90-94 vs DeepSeek R1's 97+. Reliability varies unpredictably across problem types. **Native reasoning models** ([DeepSeek V4](/models/deepseek-v4), [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b)) have reasoning built into architecture — they auto-detect when to reason and produce more reliable traces. DeepSeek V4 leads open-weight at AIME mathematics (80-85 pass@1).
**When reasoning helps**: multi-step math proofs, competitive programming (Codeforces Div2 D+), legal analysis with counterfactual logic, debugging requiring root-cause analysis across files, scientific problem-solving. **When reasoning wastes tokens**: simple classification, known-fact lookup, single-step translation, straightforward summarization, creative "list three options" tasks. At API pricing ($1-2/M reasoning tokens), every reasoning query costs 3-10× a non-reasoning equivalent.
GSM8K (grade-school math) is largely saturated — distilled 8B models hit 85+. MATH and AIME meaningfully separate reasoning tiers.
If you just want to try this
Lowest-friction path to a working setup.
Start with [DeepSeek R1 Distill Llama 8B](/models/deepseek-r1-distill-llama-8b) on [Ollama](/tools/ollama). This is the smallest reasoning model that produces coherent CoT — under 5 GB at Q4, runs on an [RTX 3060 12GB](/hardware/rtx-3060-12gb) or CPU-only on a modern laptop. Run `ollama pull deepseek-r1:8b`, ask a math/logic question. The `<think>` block shows the reasoning trace before the answer.
The model solves most single-step math problems and simple logic puzzles. Expect failure on 5+ step reasoning chains — the 8B parameter budget limits depth regardless of the CoT wrapper. When it gets a problem wrong, the reasoning trace reveals where the chain broke (mid-step arithmetic error, lost variable, assumption collapse).
If you want stronger reasoning without hardware upgrades, [Llama 3.3 70B](/models/llama-3-3-70b) on [Ollama](/tools/ollama) with a system prompt of "Think through this step by step. Show your work before answering." produces reliable CoT on most problems. Requires ~40 GB combined memory at Q4 — achievable on [RTX 4090](/hardware/rtx-4090) with partial offload or [MacBook Pro 16 M4 Max 64GB](/hardware/macbook-pro-16-m4-max).
Don't start with full DeepSeek V4 or Qwen 3 235B as a beginner. The multi-GPU hardware requirement (192 GB+ VRAM) is massive and you won't learn how reasoning models work faster than with the 8B distill.
For production deployment
Operator-grade recommendation.
Production reasoning pipelines require choosing self-hosted vs API reasoning based on query volume, latency needs, and reasoning depth.
**Self-hosted with [vLLM](/tools/vllm)**: Serving reasoning models requires vLLM's reasoning token streaming — users see CoT traces as they generate, improving perceived latency. For [DeepSeek V4](/models/deepseek-v4)-class models, plan 4-8× [H100 PCIe](/hardware/nvidia-h100-pcie) or 2-4× [MI300X](/hardware/amd-mi300x). MoE means ~37-50B active parameters per token, but all experts must be loaded across the cluster.
**Reasoning token economics**: Frontier APIs charge 3-10× standard rates for reasoning tokens. At $2/M reasoning input and $10/M output, a single complex problem generating 20K reasoning tokens costs $0.20+. At 100K queries/month, this is $20K+/month. Self-hosting breaks even faster for reasoning workloads than standard inference because the reasoning premium amplifies API costs.
**When to use API reasoning**: (1) Bursty workloads with 90% idle hardware. (2) When reasoning quality must be frontier — closed-source APIs (Claude extended thinking, OpenAI o3) lead open-weight on AIME and GPQA by 2-5 points. (3) When you lack the ops team for multi-GPU MoE serving. [Cursor](/tools/cursor) and [Cline](/tools/cline) default to API reasoning for this reason — the maintenance cost of self-hosted frontier reasoning exceeds the API bill for teams under 50 engineers.
**Monitoring reasoning quality**: (a) Trace truncation — the model hits max_tokens mid-thought producing garbage output. Set max_reasoning_tokens separately from max_output_tokens. (b) Reasoning collapse — CoT reduces to "thinking..." with no actual steps. (c) Infinite loops — common on MoE where expert routing oscillates. Log reasoning traces for debugging (2-8× output storage cost).
**Agentic reasoning costs**: Agents that chain multiple reasoning calls (plan → execute → reflect → replan) amplify tokens 3-5× per task. Budget 10K-50K reasoning tokens per agentic task. [Aider](/tools/aider) with DeepSeek V4 uses ~15K reasoning tokens per SWE-bench code change.
What breaks
Failure modes operators see in the wild.
- **Confident wrong reasoning.** The model produces a detailed, step-by-step trace arriving at a wrong conclusion. Every intermediate step looks plausible, but one contains a factual error or logical leap that cascades. Mitigation: run a verifier pass — feed answer + trace to a second model (or the same model in non-reasoning mode) asking "is this reasoning valid?" Catches ~30-50% of confident errors.
- **Infinite reasoning loops.** The model generates reasoning tokens indefinitely, circling the same logic tree without converging. The <think> block fills the context window with no final answer. Common on paradoxes and underspecified problems. Mitigation: hard-cap reasoning tokens at 4-8× expected output length. Monitor reasoning-to-output ratio; alert if it exceeds 10:1.
- **Reasoning collapse on ambiguous problems.** The model sees the problem as unsolvable and outputs a one-line trace ("problem is ambiguous, cannot solve"). Indistinguishable from a lazy failure to the user. Mitigation: prompt with "if underspecified, state assumptions and proceed with the most reasonable interpretation."
- **Token budget explosion on long-context reasoning.** A 60K-token document analyzed with multi-step logic generates 20K-40K reasoning tokens. Combined input + reasoning + output exceeds context window mid-generation. Mitigation: chunk input, reason per chunk, synthesize. Map-reduce reasoning — more compute but prevents context overflow.
- **Politeness decay under reasoning pressure.** When token-constrained on hard problems, models skip normative responses — safety disclaimers, polite framing — because reasoning consumed the social-norm budget. Mitigation: reserve 100-200 output tokens for normative framing, separate from reasoning budget.
- **Tool-use reasoning deadlock.** A reasoning model with tools (search, code execution) reasons itself into a state needing tool output to continue but hasn't called the tool yet because it's "still thinking." The agent times out. Mitigation: enforce a reasoning-token trigger that forces tool calls after N tokens — built into [Cline](/tools/cline) and [Aider](/tools/aider) but not default in raw API calls.
Hardware guidance
**Hobbyist ($500-$1,500)**: [RTX 3060 12GB](/hardware/rtx-3060-12gb) or [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb). Runs distilled reasoning models (DeepSeek R1 Distill Llama 8B) at Q4-Q5, 40-80 tok/s. 7-8B models fit 12-16 GB comfortably. Cannot run 70B reasoning distills — they need ~40 GB at Q4. CPU-only viable with [llama.cpp](/tools/llama-cpp) for 8B distills on 16+ GB RAM at 10-20 tok/s.
**SMB ($2,000-$4,000)**: [RTX 4090 24GB](/hardware/rtx-4090) or [RTX 5090 32GB](/hardware/rtx-5090). Runs Llama 3.3 70B with reasoning prompts at Q4 (~40 GB combined, partial offload on 4090, full-VRAM on 5090). 15-25 tok/s for 70B. RTX 5090 32GB is the single-card sweet spot for 70B Q4 reasoning with 16K context. Also runs Qwen 3 32B at Q8 with CoT prompts at 50-80 tok/s.
**Enterprise ($8,000-$25,000)**: 2× [RTX 5090](/hardware/rtx-5090) (64 GB total) or [RTX A6000](/hardware/rtx-a6000) 48 GB. Runs Qwen 3 32B at Q8 with 32K+ context, or 70B at Q8 for maximum reasoning quality per token. [Mac Studio M3 Ultra 192GB](/hardware/mac-studio-m3-ultra) is the unique Apple Silicon path to DeepSeek V4 Q2-Q3 on a single desktop — 3-8 tok/s, functional but slow. No CUDA alternative hits this memory ceiling on a single device.
**Frontier ($50,000+)**: 4-8× [H100 PCIe](/hardware/nvidia-h100-pcie) or 2-4× [MI300X](/hardware/amd-mi300x). Required for DeepSeek V4 full MoE at FP8 or Qwen 3 235B at full precision for production serving with reasonable latency. [NVIDIA H200](/hardware/nvidia-h200) 141 GB enables 1-2 card DeepSeek V4 Q4 — minimum viable frontier reasoning for a team of 5-20 developers.
Runtime guidance
**If running distilled reasoning (<70B) for individual use** → [Ollama](/tools/ollama). Supports native <think> tag rendering, shows CoT traces inline. Works on macOS (Metal), Windows (CUDA), Linux (CUDA/ROCm). Zero configuration.
**If running 70B+ reasoning on multi-GPU server** → [vLLM](/tools/vllm). Tensor parallelism across GPUs for models exceeding one card. Efficient MoE expert routing for DeepSeek V4-class. Configurable reasoning token streaming via server-sent events. Set `--max-model-len` to account for reasoning tokens beyond output. Chunked prefill is essential for long-context reasoning — prevents blocking during 60K-token prefills.
**If on Apple Silicon** → [MLX LM](/tools/mlx-lm). Uses Metal directly, yielding 15-30% better throughput than llama.cpp Metal backend. Better memory management for variable-length reasoning+answer vs llama.cpp's fixed buffers. [LM Studio](/tools/lm-studio) bundles MLX with reasoning trace rendering — simplest Apple Silicon experience.
**If maximizing throughput on NVIDIA** → [TensorRT-LLM](/tools/tensorrt-llm). Builds optimized engine files per model+GPU. Inflight batching benefits reasoning workloads — different trace lengths across concurrent requests amortize queue times. Build cost: 10-30 minutes per model; worth it for sustained serving.
**If building agentic reasoning loops** → [Aider](/tools/aider) with `--reasoning-effort` flag, or [Cline](/tools/cline) with Claude extended thinking. Their plan→execute→reflect loops separate reasoning from action tokens. For open-weight: pair a reasoning model (DeepSeek V4 via vLLM) with a fast tool-calling model (Qwen 3 32B) — reasoning model plans, tool-calling model executes.