DeepSeek R1 Distill Qwen 32B
32B distill — fits on a single 24GB card with reasoning capability. Best price-per-thinking-token combo for prosumers.
Positioning
The "best reasoning model that runs full-GPU on a single 24 GB card." If you can't accept the 70B distill's offload speeds (22–28 tok/s) and want pure-VRAM throughput (70+ tok/s) with serious reasoning training, this is the pick.
Strengths
- 19 GB at Q4_K_M — full GPU on 24 GB, no offload, 70+ tok/s.
- R1-class reasoning training — closes most of the gap vs base Qwen 2.5 32B on math/code.
- Qwen license — same MAU caps as base Qwen 2.5 32B.
Limitations
- Below the 70B Distill on absolute reasoning ceiling.
- Verbose chain-of-thought — same token-cost concern as other reasoning models.
- Generalist quality slightly lags base Qwen 2.5 32B for simple chat.
Real-world performance on RTX 4090
- Q4_K_M (19.4 GB): 68–86 tok/s decode (with chain-of-thought verbosity)
- Q5_K_M (22.9 GB): 56–70 tok/s
- Q8_0 (35 GB): partial offload, 18–24 tok/s
Should you run this locally?
Yes, for 24 GB GPU owners who want strong reasoning at full-GPU speed. Best speed-quality tradeoff in the reasoning space. No, for users who can accept 70B's offload speed — pick R1 Distill Llama 70B for higher reasoning ceiling.
How it compares
- vs DeepSeek R1 Distill Llama 70B → 70B is smarter; this 32B is much faster (full GPU). Pick by speed-vs-quality.
- vs QwQ 32B → similar size, R1 Distill wins on hardest reasoning; QwQ has slightly cleaner everyday traces. R1 Distill is the stronger pick for math/code planning.
- vs Qwen 3 32B with thinking mode → Qwen 3 32B is more flexible (thinking toggle); R1 Distill has more aggressive reasoning training. Coin flip.
Run this yourself
ollama pull deepseek-r1:32b-distill-qwen-q4_K_M
ollama run deepseek-r1:32b-distill-qwen-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090
›Why this rating
8.8/10 — the right reasoning model for users who want full-GPU offload on a 24 GB card. Distills R1 reasoning into the Qwen 2.5 32B body — fits in 19 GB at Q4, no system-RAM partial-offload required. Loses fractional points to the 70B distill on absolute reasoning quality.
Overview
32B distill — fits on a single 24GB card with reasoning capability. Best price-per-thinking-token combo for prosumers.
Featured in these stacks
The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Workstation tier·Role: Primary reasoning modelBuild a local reasoning-model stack (May 2026)
DeepSeek R1 Distill Qwen 32B is the reasoning model that actually runs at 24GB VRAM via AWQ-INT4. Stronger reasoning quality per parameter than the full DeepSeek R1 (which needs ~700GB and is impossible locally). Distill gives ~80% of R1's reasoning at 5% of the VRAM.
- Stack · L3·Production tier·Role: 32B reasoning modelDual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink
32B class on dual-4090 leaves substantial headroom — 64K context fits comfortably. R1 distill brings explicit thinking-mode emission for hard reasoning tasks. The L1.25-enriched workstation reasoning canonical.
Execution notes
Operator notes
DeepSeek R1 Distill Qwen 32B is the canonical local-reasoning-model deployment in May 2026. Captures ~80% of full DeepSeek R1's reasoning quality in 5% of the VRAM. Apache 2.0. The /stacks/local-reasoning-model recipe is built around this configuration.
The honest framing: full DeepSeek R1 isn't realistically deployable locally (~700GB at any quant). The distill is what makes serious reasoning workloads viable on consumer hardware. Distilled into a Qwen 2.5 32B base; preserves R1's reasoning-token emission discipline while fitting RTX 4090 / 5090 / M3 Max class hardware.
Deployment notes
The /stacks/local-reasoning-model recipe pairs this model with vLLM on RTX 4090 24GB with AWQ-INT4 + 32K context + chunked prefill. Hits ~32 tok/s decode after the initial 1500-3000 thinking tokens emission. End-to-end on a typical reasoning query: 50-90 seconds wall-clock — slower than chat models, much higher quality on math / code / analysis.
For 16GB VRAM, drop to DeepSeek R1 Distill Qwen 14B — meaningfully less reasoning depth but viable.
For Apple Silicon, MLX-LM on M3 Max 64GB / M4 Max handles the 32B distill comfortably; expect ~24-28 tok/s decode (vs RTX 4090's ~32).
For multi-user serving, SGLang over vLLM — RadixAttention's prefix-cache wins compound on reasoning workloads where the system prompt is stable across queries.
Runtime compatibility
- vLLM ✓ excellent. AWQ-INT4 + chunked prefill; the production-default path. Use `--enforce-eager` to avoid CUDA graph compilation issues some R1 distill versions trigger.
- SGLang ✓ excellent. Same quant; stronger on multi-user concurrency where the reasoning system prompt is shared across sessions.
- Ollama ✓ good. Q4_K_M GGUF available; loses concurrency benefits but wins on single-user setup time.
- MLX-LM ✓ good. Apple Silicon path with MLX-4bit quant.
- TensorRT-LLM ✗ not recommended. Recompile-per-config friction kills agent-loop iteration speed.
Quantization suitability
AWQ-INT4 fits 22GB with reasoning-block headroom on a 24GB card. Drop `gpu-memory-utilization` to 0.85 (not 0.9 default) — KV cache pressure is higher than non-reasoning 32B models because reasoning queries emit 1500-3000+ tokens of thinking blocks before the answer. Q3-class loses ~6% on math benchmarks; meaningful enough that we don't recommend it.
Best use cases
- Single-machine reasoning — math, multi-step analysis, complex code synthesis. The 32B distill hits the sweet spot of "frontier-quality reasoning at consumer hardware."
- Agent planning steps — the Anthropic-style "thinking → planning → execution" pattern pairs naturally; use this model for the planning step and a faster non-reasoning model for the execution loop.
- Math / scientific computing — verified accuracy on advanced math benchmarks rivals closed-source.
- Research / academic reasoning — Apache 2.0 license clean for any deployment.
When to use a different model
- Tight latency budgets: reasoning blocks add 50-90 seconds wall-clock per query. For sub-second response, use Qwen 2.5 Coder 32B or Qwen 3 32B (toggle reasoning off for chat).
- 16GB VRAM tier: drop to DeepSeek R1 Distill Qwen 14B.
- 8GB VRAM: drop further to DeepSeek R1 Distill Qwen 7B.
- Edge / phone tier: DeepSeek R1 Distill Qwen 1.5B — surprisingly capable reasoning at the smallest tier.
- Frontier-tier reasoning: cluster-deploy full DeepSeek R1 or use API.
Failure modes specific to this model
- Reasoning blocks leak into structured output. If the model is instructed to emit JSON, the `` block can break the parse. Strip thinking tokens or instruct the model to skip reasoning for structured queries.
- Sampler config sensitivity. Reasoning models are more sensitive to sampler parameters than chat models. Use temperature 0.6-0.8; the chat default of 1.0 produces meaningfully worse reasoning.
- Premature stopping on EOS during reasoning. Some configs treat `` as a stop token. Verify stop-token list excludes reasoning-block delimiters.
- Token-cost runaway. A reasoning chain of 5000 tokens on a 4-step agent loop costs 20K tokens before the agent does anything else. Set per-task token budgets.
Going deeper
- /stacks/local-reasoning-model — the canonical deployment recipe
- /systems/agent-execution-systems — when reasoning helps agent workflows
- DeepSeek R1 — the parent / full model
- Qwen 3 32B — the alternative with toggle-style reasoning
- QwQ 32B — the Qwen team's reasoning alternative
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- MIT
- Single-24GB-card reasoner
Weaknesses
- Verbose CoT inflates output cost
Prompting kit
Tested patterns for getting the most out of DeepSeek R1 Distill Qwen 32B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.
Quirks to know
- •Distilled from DeepSeek-R1 into Qwen2.5 32B. Retains R1's <think>...</think> reasoning behavior per the model card.
- •No system prompt — same DeepSeek rule. Put instructions in the user message.
- •Uses Qwen's ChatML chat template (NOT DeepSeek's pipe markers) because this is a Qwen2.5 fine-tune.
- •Per the model card, this is the recommended R1 distill for 24-32GB-VRAM rigs — it nearly matches R1-Distill-Llama-70B at half the VRAM footprint.
- •Tool calling is not officially supported (same as the Llama distill). The distillation overwrites the base model's tool-call tuning.
Chat template
<|im_start|>{role}\n{content}<|im_end|>. Same template as base Qwen2.5 32B. Ships in tokenizer_config.json.
Tool calling
Tool calling unreliable post-distillation per the model card. For tool use on Qwen, fall back to base Qwen2.5 32B Instruct or Qwen 3 32B.
Sampler settings
- temperature
- 0.6
- top_p
- 0.95
Per the model card, recommended sampling is in the 0.5-0.7 range. Same as base R1.
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 19.0 GB | 24 GB |
| Q8_0 | 34.0 GB | 40 GB |
Get the model
Ollama
One-line install
ollama run deepseek-r1:32bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of DeepSeek R1 Distill Qwen 32B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run DeepSeek R1 Distill Qwen 32B?
Can I use DeepSeek R1 Distill Qwen 32B commercially?
What's the context length of DeepSeek R1 Distill Qwen 32B?
How do I install DeepSeek R1 Distill Qwen 32B with Ollama?
Compare against other models
Curated head-to-head decisions where DeepSeek R1 Distill Qwen 32B is one of the contenders. For arbitrary pairings use /model-battle.
Source: huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify DeepSeek R1 Distill Qwen 32B runs on your specific hardware before committing money.