deepseek
671B parameters
Commercial OK
Reviewed June 2026

DeepSeek R1 (671B reasoning)

Open reasoning model that closed the gap with frontier proprietary reasoners. Visible chain-of-thought, MIT license, and a family of distilled smaller variants.

License: MIT·Released Jan 20, 2025·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
9.0/10

Positioning

DeepSeek R1 is the o1-equivalent open-weight model — explicit reasoning training, visible chain-of-thought, state-of-the-art on math and competitive programming benchmarks. Same MoE architecture as V3, same workstation-class hardware requirement.

Strengths

  • Reasoning ceiling matches closed frontier models — true o1-class on hard math and code planning.
  • Fully open weights — uniquely valuable in the reasoning space where most leaders are closed.
  • Clean MIT-style license.

Limitations

  • Workstation hardware required — same ~380 GB footprint as V3.
  • Verbose chain-of-thought consumes lots of tokens.
  • Distill versions exist (R1 Distill 70B, 32B, 14B, 7B) — those are the practical local picks.

Real-world performance on RTX 4090

  • Direct R1 Q4_K_M (~380 GB) — workstation only, same as V3
  • Practical local path: run R1 Distill Llama 70B or R1 Distill Qwen 32B (much more accessible)

Should you run this locally?

Yes, for workstation owners — same hardware story as V3. No, for consumer hardware — pick the R1 Distill variants instead, which deliver most of the reasoning quality at viable hardware costs.

How it compares

  • vs DeepSeek V3 → R1 is the reasoning specialist, V3 is the generalist. Different jobs.
  • vs DeepSeek R1 Distill Llama 70B → Distill is much more accessible (single 4090 with offload) and captures most of the reasoning lift. Default pick for local hardware.
  • vs QwQ 32B → QwQ is the reasoning specialist that fits on a single 4090; R1 has higher ceiling.
  • vs OpenAI o1 → R1 is the open-weight equivalent; quality competitive on math/code.

Run this yourself

# For local hardware, prefer the distills:
ollama pull deepseek-r1:70b-distill-llama-q4_K_M
ollama pull deepseek-r1:32b-distill-qwen-q4_K_M
Direct R1 settings: Q4_K_M, multi-GPU, A100/H100 cluster
Why this rating

9.0/10 — DeepSeek's reasoning specialist matches o1-class performance on hard problems and is fully open-weight. Same workstation-size reality as V3. Loses fractional points only on hardware barrier.

Overview

Open reasoning model that closed the gap with frontier proprietary reasoners. Visible chain-of-thought, MIT license, and a family of distilled smaller variants.

Execution notes

L1.25 enriched

Operator notes

DeepSeek R1 is the frontier-tier open-weight reasoning model released in January 2025. Explicit `` reasoning blocks are the architectural primitive — the model emits its chain of thought before the answer. Beats GPT-4o on math benchmarks; closes the gap with closed-source frontier models on reasoning.

The honest framing for local deployment: R1 itself is not realistically deployable locally. ~700 GB of weights at any quant requires multi-machine clustering. The local-AI value of R1 is the distill family — DeepSeek's R1-Distill-Qwen-32B / 14B / 7B / 1.5B, plus R1-Distill-Llama-70B / 8B. The distills capture 60-80% of R1's reasoning at consumer-card-friendly memory footprints.

Deployment notes

Frontier-tier deployment (the actual full R1):

  • Multi-node cluster: 2x DGX or 8x H100 SXM
  • vLLM tensor-parallel-size=8 + pipeline-parallel-size=2 via Ray
  • Or: cloud API access via DeepSeek's hosted endpoint (vastly cheaper than self-hosting at most usage tiers)

Local deployment = use the distills:

Runtime compatibility (full R1)

Multi-node only via vLLM + Ray (canonical) or SGLang + Ray (RadixAttention compounds across replicas at cluster scale). The deployment story is the same as /stacks/distributed-inference-homelab for any frontier-MoE.

When to use a different model

  • Coding workloads: full R1 is overkill; use Qwen 2.5 Coder 32B for non-reasoning coding, DeepSeek R1 Distill Qwen 32B for reasoning + coding.
  • Latency-sensitive workflows: reasoning models add 50-90 seconds wall-clock per query (1500-3000 thinking tokens). For chat or sub-second response, use non-reasoning models.
  • Token-cost-sensitive workloads on cloud APIs: the reasoning-token tax at API tier is real money. Use only when reasoning quality justifies the cost.
  • Newer release available: DeepSeek V4 launched March 2026 and is the current open-weight benchmark leader.

Best use cases

  • Math + scientific computing — verified accuracy on advanced math benchmarks rivals closed-source.
  • Multi-step proof construction — explicit reasoning emission is the right paradigm.
  • Code synthesis with deep reasoning — when the agent needs to plan multi-file architecture before writing.
  • Reasoning-research workloads — the open-weight reasoning baseline for academic research.

Failure modes

  1. Reasoning-token context exhaustion. A query with 5000 thinking tokens leaves ~25 K of 32 K context for the answer. Set conservative max_model_len if your queries are long.
  2. Reasoning blocks leak into structured output. If your client parses output as JSON, the <think> block breaks the parse. Strip thinking tokens or instruct the model to skip reasoning for structured queries.
  3. Sampler config sensitivity. Reasoning models are more sensitive than chat models — temperature 0.6-0.8 produces meaningfully better reasoning than the chat default of 1.0.

Going deeper

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Distilled / fine-tuned from this

Strengths

  • MIT license
  • Frontier reasoning quality
  • Visible CoT

Weaknesses

  • 671B is server-only
  • Verbose by default

Prompting kit

From model card
source

Tested patterns for getting the most out of DeepSeek R1 (671B reasoning) locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Quirks to know

  • DeepSeek explicitly recommends against using a system prompt with R1. Put all instructions, persona, and constraints in the user message instead. The model card states: 'avoid adding a system prompt; all instructions should be contained within the user prompt.'
  • R1 emits visible reasoning between <think>...</think> blocks before the final answer. This is by design — don't strip the tokens, but only show the post-</think> content to the end user if you want a clean UX.
  • Per the model card, when the model occasionally bypasses thinking (no <think> block), you can force it by prepending '<think>\n' to the assistant turn.
  • Avoid few-shot examples — DeepSeek's model card observes that few-shot prompting degrades R1's performance compared to a clear zero-shot instruction.
  • For math and code, the model card recommends asking the model to 'reason step by step, and put your final answer within \boxed{}'.

Chat template

DeepSeek (User/Assistant markers)

Uses <|User|> and <|Assistant|> Unicode pipe markers, not standard ChatML. Most runtimes ship the canonical template via tokenizer_config.json — apply that rather than hand-rolling.

Tool calling

✗ Not supported

R1 was released without official tool-calling support. The model card flags this as a known limitation. For tool use, DeepSeek recommends DeepSeek-V3 (the non-reasoning sibling).

Sampler settings

temperature
0.6

Per the model card, recommended sampling temperature is in the 0.5-0.7 range, with 0.6 as the published default. Lower values can cause repetition; higher values can cause incoherent reasoning.

Browse prompting kits for every model →/prompting

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M380.0 GB420 GB

Get the model

Ollama

One-line install

ollama run deepseek-r1:671bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/deepseek-ai/DeepSeek-R1

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of DeepSeek R1 (671B reasoning).

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.

Frequently asked

What's the minimum VRAM to run DeepSeek R1 (671B reasoning)?

420GB of VRAM is enough to run DeepSeek R1 (671B reasoning) at the Q4_K_M quantization (file size 380.0 GB). Higher-quality quantizations need more.

Can I use DeepSeek R1 (671B reasoning) commercially?

Yes — DeepSeek R1 (671B reasoning) ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of DeepSeek R1 (671B reasoning)?

DeepSeek R1 (671B reasoning) supports a context window of 131,072 tokens (about 131K).

How do I install DeepSeek R1 (671B reasoning) with Ollama?

Run `ollama pull deepseek-r1:671b` to download, then `ollama run deepseek-r1:671b` to start a chat session. The default quantization is Q4_K_M.

Compare against other models

Curated head-to-head decisions where DeepSeek R1 (671B reasoning) is one of the contenders. For arbitrary pairings use /model-battle.

Source: huggingface.co/deepseek-ai/DeepSeek-R1

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Recommended hardware
Before you buy

Verify DeepSeek R1 (671B reasoning) runs on your specific hardware before committing money.