DeepSeek R1 (671B reasoning)
Open reasoning model that closed the gap with frontier proprietary reasoners. Visible chain-of-thought, MIT license, and a family of distilled smaller variants.
Positioning
DeepSeek R1 is the o1-equivalent open-weight model — explicit reasoning training, visible chain-of-thought, state-of-the-art on math and competitive programming benchmarks. Same MoE architecture as V3, same workstation-class hardware requirement.
Strengths
- Reasoning ceiling matches closed frontier models — true o1-class on hard math and code planning.
- Fully open weights — uniquely valuable in the reasoning space where most leaders are closed.
- Clean MIT-style license.
Limitations
- Workstation hardware required — same ~380 GB footprint as V3.
- Verbose chain-of-thought consumes lots of tokens.
- Distill versions exist (R1 Distill 70B, 32B, 14B, 7B) — those are the practical local picks.
Real-world performance on RTX 4090
- Direct R1 Q4_K_M (~380 GB) — workstation only, same as V3
- Practical local path: run R1 Distill Llama 70B or R1 Distill Qwen 32B (much more accessible)
Should you run this locally?
Yes, for workstation owners — same hardware story as V3. No, for consumer hardware — pick the R1 Distill variants instead, which deliver most of the reasoning quality at viable hardware costs.
How it compares
- vs DeepSeek V3 → R1 is the reasoning specialist, V3 is the generalist. Different jobs.
- vs DeepSeek R1 Distill Llama 70B → Distill is much more accessible (single 4090 with offload) and captures most of the reasoning lift. Default pick for local hardware.
- vs QwQ 32B → QwQ is the reasoning specialist that fits on a single 4090; R1 has higher ceiling.
- vs OpenAI o1 → R1 is the open-weight equivalent; quality competitive on math/code.
Run this yourself
# For local hardware, prefer the distills:
ollama pull deepseek-r1:70b-distill-llama-q4_K_M
ollama pull deepseek-r1:32b-distill-qwen-q4_K_M
Direct R1 settings: Q4_K_M, multi-GPU, A100/H100 cluster
›Why this rating
9.0/10 — DeepSeek's reasoning specialist matches o1-class performance on hard problems and is fully open-weight. Same workstation-size reality as V3. Loses fractional points only on hardware barrier.
Overview
Open reasoning model that closed the gap with frontier proprietary reasoners. Visible chain-of-thought, MIT license, and a family of distilled smaller variants.
Execution notes
Operator notes
DeepSeek R1 is the frontier-tier open-weight reasoning model released in January 2025. Explicit `` reasoning blocks are the architectural primitive — the model emits its chain of thought before the answer. Beats GPT-4o on math benchmarks; closes the gap with closed-source frontier models on reasoning.
The honest framing for local deployment: R1 itself is not realistically deployable locally. ~700 GB of weights at any quant requires multi-machine clustering. The local-AI value of R1 is the distill family — DeepSeek's R1-Distill-Qwen-32B / 14B / 7B / 1.5B, plus R1-Distill-Llama-70B / 8B. The distills capture 60-80% of R1's reasoning at consumer-card-friendly memory footprints.
Deployment notes
Frontier-tier deployment (the actual full R1):
- Multi-node cluster: 2x DGX or 8x H100 SXM
- vLLM tensor-parallel-size=8 + pipeline-parallel-size=2 via Ray
- Or: cloud API access via DeepSeek's hosted endpoint (vastly cheaper than self-hosting at most usage tiers)
Local deployment = use the distills:
- DeepSeek R1 Distill Qwen 32B: RTX 4090 24 GB, AWQ-INT4. The /stacks/local-reasoning-model canonical recipe. ~32 tok/s decode + 1500-3000 thinking tokens per query.
- DeepSeek R1 Distill Llama 70B: 2x A100 80 GB or H100. Production-tier reasoning.
- DeepSeek R1 Distill Qwen 14B: 16 GB VRAM tier — sweet spot for budget reasoning.
- DeepSeek R1 Distill Qwen 7B: 8 GB VRAM tier — reasoning at consumer scale.
- DeepSeek R1 Distill Qwen 1.5B: edge / phone tier.
Runtime compatibility (full R1)
Multi-node only via vLLM + Ray (canonical) or SGLang + Ray (RadixAttention compounds across replicas at cluster scale). The deployment story is the same as /stacks/distributed-inference-homelab for any frontier-MoE.
When to use a different model
- Coding workloads: full R1 is overkill; use Qwen 2.5 Coder 32B for non-reasoning coding, DeepSeek R1 Distill Qwen 32B for reasoning + coding.
- Latency-sensitive workflows: reasoning models add 50-90 seconds wall-clock per query (1500-3000 thinking tokens). For chat or sub-second response, use non-reasoning models.
- Token-cost-sensitive workloads on cloud APIs: the reasoning-token tax at API tier is real money. Use only when reasoning quality justifies the cost.
- Newer release available: DeepSeek V4 launched March 2026 and is the current open-weight benchmark leader.
Best use cases
- Math + scientific computing — verified accuracy on advanced math benchmarks rivals closed-source.
- Multi-step proof construction — explicit reasoning emission is the right paradigm.
- Code synthesis with deep reasoning — when the agent needs to plan multi-file architecture before writing.
- Reasoning-research workloads — the open-weight reasoning baseline for academic research.
Failure modes
- Reasoning-token context exhaustion. A query with 5000 thinking tokens leaves ~25 K of 32 K context for the answer. Set conservative max_model_len if your queries are long.
- Reasoning blocks leak into structured output. If your client parses output as JSON, the
<think>block breaks the parse. Strip thinking tokens or instruct the model to skip reasoning for structured queries. - Sampler config sensitivity. Reasoning models are more sensitive than chat models — temperature 0.6-0.8 produces meaningfully better reasoning than the chat default of 1.0.
Going deeper
- /stacks/local-reasoning-model — the local-deployment recipe using the 32B distill
- /systems/distributed-inference — the architecture for multi-node R1 deployment
- DeepSeek V4 — the May 2026 open-weight benchmark leader
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- MIT license
- Frontier reasoning quality
- Visible CoT
Weaknesses
- 671B is server-only
- Verbose by default
Prompting kit
Tested patterns for getting the most out of DeepSeek R1 (671B reasoning) locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.
Quirks to know
- •DeepSeek explicitly recommends against using a system prompt with R1. Put all instructions, persona, and constraints in the user message instead. The model card states: 'avoid adding a system prompt; all instructions should be contained within the user prompt.'
- •R1 emits visible reasoning between <think>...</think> blocks before the final answer. This is by design — don't strip the tokens, but only show the post-</think> content to the end user if you want a clean UX.
- •Per the model card, when the model occasionally bypasses thinking (no <think> block), you can force it by prepending '<think>\n' to the assistant turn.
- •Avoid few-shot examples — DeepSeek's model card observes that few-shot prompting degrades R1's performance compared to a clear zero-shot instruction.
- •For math and code, the model card recommends asking the model to 'reason step by step, and put your final answer within \boxed{}'.
Chat template
Uses <|User|> and <|Assistant|> Unicode pipe markers, not standard ChatML. Most runtimes ship the canonical template via tokenizer_config.json — apply that rather than hand-rolling.
Tool calling
R1 was released without official tool-calling support. The model card flags this as a known limitation. For tool use, DeepSeek recommends DeepSeek-V3 (the non-reasoning sibling).
Sampler settings
- temperature
- 0.6
Per the model card, recommended sampling temperature is in the 0.5-0.7 range, with 0.6 as the published default. Lower values can cause repetition; higher values can cause incoherent reasoning.
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 380.0 GB | 420 GB |
Get the model
Ollama
One-line install
ollama run deepseek-r1:671bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of DeepSeek R1 (671B reasoning).
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run DeepSeek R1 (671B reasoning)?
Can I use DeepSeek R1 (671B reasoning) commercially?
What's the context length of DeepSeek R1 (671B reasoning)?
How do I install DeepSeek R1 (671B reasoning) with Ollama?
Compare against other models
Curated head-to-head decisions where DeepSeek R1 (671B reasoning) is one of the contenders. For arbitrary pairings use /model-battle.
Source: huggingface.co/deepseek-ai/DeepSeek-R1
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify DeepSeek R1 (671B reasoning) runs on your specific hardware before committing money.