deepseek

70B parameters

Commercial OK

Reviewed June 2026

DeepSeek R1 Distill Llama 70B

Reasoning distillation onto Llama 3.3 70B. Best-in-class open-weight reasoner you can actually fit on a workstation.

License: MIT·Released Jan 20, 2025·Context: 131,072 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

9.0/10

Positioning

The most important practical-local model release of 2025. R1 Distill Llama 70B takes R1's reasoning training and applies it to Llama 3.3 70B — you keep the runs-on-a-4090 footprint and gain dramatic reasoning capability. This is the model RTX 3090 / 4090 / 5090 owners should be running for hard problems.

Strengths

Frontier-adjacent reasoning in a 70B-class footprint that runs locally.
Same Llama 3.3 70B VRAM at Q4 — no new hardware needed if you already run Llama 3.3 70B.
Llama license carries through — same permissive commercial terms.

Limitations

Verbose chain-of-thought — 2–3× token cost vs base Llama 3.3 70B.
Generalist quality slightly below base Llama 3.3 70B on simple tasks — pure reasoning training has a small everyday-chat tax.
Same partial-offload speeds as Llama 3.3 70B on 24 GB cards (22–28 tok/s).

Real-world performance on RTX 4090

Q4_K_M (39 GB) — partial offload: 21–27 tok/s decode, but 2–3× tokens per answer
Q5_K_M (47 GB) — heavy offload: 9–13 tok/s
Q8_0 (70 GB) — workstation only

Should you run this locally?

Yes, for anyone who has a single 24 GB card and wants near-frontier reasoning. The most important reasoning-model decision in local AI right now. No, for users on smaller cards (use R1 Distill Qwen 14B instead) or for general chat (base Llama 3.3 70B is faster and equally capable on simple prompts).

How it compares

vs Llama 3.3 70B (base) → R1 Distill wins decisively on reasoning; base wins on simple-task throughput. Run both side by side if disk allows.
vs DeepSeek R1 (full) → full R1 has higher ceiling but requires workstation; Distill 70B captures ~80% of the lift on consumer hardware.
vs QwQ 32B → R1 Distill 70B wins decisively on reasoning quality; QwQ 32B wins on speed (full GPU on 24 GB).
vs DeepSeek R1 Distill Qwen 32B → 70B Llama is smarter; 32B Qwen is faster (full GPU). Pick by speed-vs-quality.

Run this yourself

ollama pull deepseek-r1:70b-distill-llama-q4_K_M
ollama run deepseek-r1:70b-distill-llama-q4_K_M

Settings: Q4_K_M GGUF, 16384 ctx, --n-gpu-layers 65 of 81, RTX 4090 + 64 GB RAM

›Why this rating

9.0/10 — the headline practical-local model of the R1 release. Distills R1's reasoning into the Llama 3.3 70B body — runs in 39 GB Q4, same hardware as Llama 3.3 70B, but with reasoning quality that genuinely approaches frontier levels. The right pick for "I want o1-class reasoning on a single 24 GB card."

Overview

Reasoning distillation onto Llama 3.3 70B. Best-in-class open-weight reasoner you can actually fit on a workstation.

Featured in these stacks

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

Stack · L3·Workstation tier·Role: 70B reasoning model with explicit thinking-mode
Dual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs
Same hardware envelope as Llama 3.3 70B. R1 distill produces 5-15× more tokens per query (reasoning bloat), so per-stream throughput drops to 8-15 tok/s effective. Reserve for hard reasoning workloads; use Llama 3.3 for general chat.
Stack · L3·Homelab tier·Role: 70B reasoning model with extreme context
Quad RTX 3090 workstation stack — the prosumer 100B-class ceiling
70B at AWQ-INT4 (~40 GB) fits with massive headroom on quad-3090, freeing ~48 GB for KV cache — supports 64K+ context for long-document reasoning workloads.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

DeepSeek R1 (671B reasoning)671B

Frontier

Family siblings (deepseek-r1-distill)

DeepSeek R1 Distill Qwen 1.5B1.5B

Edge

DeepSeek R1 Distill Qwen 7B7B

Consumer

DeepSeek R1 Distill Llama 8B8B

Consumer

DeepSeek R1 Distill Qwen 14B14B

Consumer

DeepSeek R1 Distill Mistral 24B24B

Consumer

DeepSeek R1 Distill Qwen 3 32B32B

Workstation

DeepSeek R1 Distill Qwen 32B32B

Workstation

DeepSeek R1 Distill Llama 70B70B

You are here

Strengths

MIT license
Top reasoning at 70B
Approachable on dual 24GB

Weaknesses

Slower than non-reasoning 70B

Prompting kit

From model card

source

Tested patterns for getting the most out of DeepSeek R1 Distill Llama 70B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Quirks to know

•Distilled from DeepSeek-R1 into Llama 3.3 70B. Per the model card, retains R1's <think>...</think> reasoning behavior — the model emits visible chain-of-thought before the final answer.
•DeepSeek's release notes recommend NO system prompt (same as base R1). Put all instructions into the user message — system prompts degrade the inherited reasoning ability.
•Uses the Llama 3 chat template (NOT DeepSeek's pipe-marker template) because this is a Llama 3.3 fine-tune. Apply via the runtime's tokenizer_config.json.
•Per the model card, the recommended pattern when the model occasionally skips a <think> block is to prepend '<think>\n' to the assistant turn.
•Tool calling is not officially supported in the R1 distills — Llama 3.3's native tool calling was overwritten by the distillation. For tool use, use the base Llama 3.3 70B Instruct.

Chat template

Llama 3

Tool calling

✗ Not supported

Distillation removed Llama 3.3's tool-calling tuning. Per the model card, tool use is unreliable. For tool calling on Llama, use the base Llama 3.3 70B Instruct.

Sampler settings

temperature: 0.6
top_p: 0.95

Per the model card, recommended sampling is in the 0.5-0.7 range. Same defaults as base R1.

Browse prompting kits for every model →/prompting

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	40.0 GB	48 GB
Q5_K_M	47.0 GB	56 GB

Get the model

Ollama

One-line install

ollama run deepseek-r1:70bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of DeepSeek R1 Distill Llama 70B.

NVIDIA B300 (Blackwell Ultra)

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier

Models in the same parameter band as this one

Step up

More capable — bigger memory footprint

Step down

Smaller — faster, runs on weaker hardware

Frequently asked

What's the minimum VRAM to run DeepSeek R1 Distill Llama 70B?

48GB of VRAM is enough to run DeepSeek R1 Distill Llama 70B at the Q4_K_M quantization (file size 40.0 GB). Higher-quality quantizations need more.

Can I use DeepSeek R1 Distill Llama 70B commercially?

Yes — DeepSeek R1 Distill Llama 70B ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of DeepSeek R1 Distill Llama 70B?

DeepSeek R1 Distill Llama 70B supports a context window of 131,072 tokens (about 131K).

How do I install DeepSeek R1 Distill Llama 70B with Ollama?

Run `ollama pull deepseek-r1:70b` to download, then `ollama run deepseek-r1:70b` to start a chat session. The default quantization is Q4_K_M.

Compare against other models

Curated head-to-head decisions where DeepSeek R1 Distill Llama 70B is one of the contenders. For arbitrary pairings use /model-battle.

DeepSeek R1 Distill Llama 70B vs Llama 3.3 70B

reasoning vs instruction following

Source: huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

DeepSeek R1 Distill Qwen 7B DeepSeek R1 Distill Qwen 14B DeepSeek R1 Distill Qwen 3 32B DeepSeek R1 Distill Qwen 1.5B DeepSeek R1 Distill Llama 8B DeepSeek R1 Distill Qwen 32B DeepSeek R1 Distill Mistral 24B

Before you buy

Verify DeepSeek R1 Distill Llama 70B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →