DeepSeek R1 Distill Qwen 3 32B
Newer R1 distill on a Qwen 3 base. Combines R1 reasoning with Qwen 3's reasoning-toggle architecture. Apache 2.0.
Positioning
DeepSeek R1 Distill Qwen 3 32B is a dense 32B-parameter model released by DeepSeek AI under the permissive Apache 2.0 license. It combines the reasoning capabilities of the DeepSeek R1 distillation approach with Qwen 3's reasoning-toggle architecture, allowing operators to switch between standard and reasoning modes. With a 131K token context window, it is designed for workstation-class deployment, targeting users who need strong reasoning performance without requiring datacenter hardware.
Strengths
- Permissive Apache 2.0 license: Unlike many open-weight models with restrictive licenses, Apache 2.0 allows commercial use, modification, and redistribution with minimal conditions.
- Long 131K context window: The 131K token context enables processing of large documents, codebases, or multi-turn conversations without truncation.
- Reasoning-toggle architecture: Inherited from Qwen 3, this feature lets operators dynamically enable or disable chain-of-thought reasoning, offering flexibility between latency and depth.
- Workstation-friendly quant sizes: At Q4_K_M (18 GB) or Q3_K_M (15.6 GB), the model fits comfortably on a single 24 GB GPU, with room for KV cache overhead.
Limitations
- No community-verified benchmarks available: Published claims from the vendor should be treated as best-case; independent measurements are not yet available for this specific distill.
- Dense 32B architecture: Unlike Mixture-of-Experts models, all 32B parameters are active per forward pass, meaning inference cost scales with full parameter count.
- KV cache memory overhead: At 131K context, the KV cache can add significant memory pressure (30-50% over model weights), potentially requiring higher quants or shorter contexts on limited hardware.
- Distillation trade-offs: As a distill of R1, it may inherit some reasoning strengths but could also exhibit reduced performance on tasks requiring broad world knowledge compared to the full R1 model.
What it takes to run this locally
Quantized sizes range from 64 GB (FP16) down to ~10.4 GB (Q2_K). For practical deployment, Q4_K_M (18 GB) or Q3_K_M (~15.6 GB) are recommended for workstation-class hardware (single 24 GB GPU). Add 30-50% for KV cache and framework overhead, especially at long contexts. The model is too large for consumer GPUs (12-16 GB) except at aggressive quants (Q2_K) with limited context. Datacenter deployment is possible but unnecessary given the workstation-friendly size.
Should you run this locally?
Yes if you need a permissively licensed reasoning model that fits on a single workstation GPU, and you value the ability to toggle reasoning on/off per task.
No if you require community-verified performance data before committing, or if your hardware is limited to consumer GPUs with less than 20 GB VRAM (unless you accept Q2_K quantization and short contexts).
Catalog cross-links
- DeepSeek R1 Distill Qwen 3 32B
- DeepSeek AI
- Qwen 3 family
Overview
Newer R1 distill on a Qwen 3 base. Combines R1 reasoning with Qwen 3's reasoning-toggle architecture. Apache 2.0.
How to run it
DeepSeek R1 Distill Qwen 3 32B is DeepSeek's reasoning-distilled model based on Qwen 3 32B. Uses DeepSeek-R1's chain-of-thought reasoning distillation — the model was trained on R1's reasoning traces, giving it strong step-by-step reasoning abilities. Run at Q4_K_M via Ollama (ollama pull deepseek-r1:32b) or llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~18 GB on disk. Minimum VRAM: 16 GB — RTX 4080 (16GB) at Q4_K_M with KV offload for 4K context. RTX 4090 24GB: Q4_K_M comfortably at 8-16K context. Recommended: RTX 4090 24GB at Q4_K_M. Throughput: ~35-55 tok/s on RTX 4090 at Q4_K_M. Standard Qwen 3 architecture — broad support. The key characteristic: outputs include <think> chain-of-thought blocks before the final answer. This makes generation ~2-4× longer than standard Qwen 3 32B for the same prompt. Budget max_tokens accordingly. Use for: complex reasoning, math, logic puzzles, multi-step problem solving, code debugging. Not ideal for: quick factual lookups, simple classification — the CoT overhead isn't worth it. Context: Qwen 3's 128K (practical 8-16K on 24 GB). For the full R1 model (685B MoE), see cloud options.
Hardware guidance
Minimum: RTX 3060 12GB at Q3_K_M with KV offload. Recommended: RTX 4090 24GB at Q4_K_M (16K context). Optimal: RTX 5090 32GB at Q4_K_M (32K context, no offload). VRAM math: 32B dense, Q4_K_M ≈ 18 GB. KV cache at 16K: ~8 GB. Total: ~26 GB at 16K. RTX 4090 24GB: Q4 + 8K = ~22 GB — fits on-GPU. 16K: ~26 GB — offload KV. RTX 3090 24GB: same profile. RTX 4080 16GB: Q4 + 2K on-GPU. MacBook Pro M4 Pro 24GB+: Q4 at 10-20 tok/s. Cloud: A10 24GB at Q4_K_M. Budget for 2-4× output tokens vs non-reasoning models — the CoT blocks add substantial generation cost. AWQ-INT4 drops weights to ~16 GB.
What breaks first
- CoT token explosion. The model generates 200-1000+ extra tokens of chain-of-thought before every answer. This increases cost 2-4× and adds latency. Turn off CoT with specific prompting if you don't need it. 2.
<think>tag parsing. Outputs contain<think>...</think>blocks. Your parser must handle these — either strip them or display them separately. Failing to parse produces garbled user-facing output. 3. Simple-task overkill. For simple questions ("what is 2+2"), the model may still generate verbose CoT. Configure max_tokens and stop sequences appropriately for your use case. 4. Q3 reasoning degradation. Reasoning chains at Q3 degrade more than factual answers — intermediate steps may contain logical errors that compound into wrong final answers. Use Q4_K_M minimum for reasoning tasks.
Runtime recommendation
Ollama for quick-start (DeepSeek R1 distill tags are commonly available). llama.cpp for production — precise control over CoT tokens, stop sequences. vLLM for serving. Qwen 3 architecture — any Qwen-compatible stack works. Set temperature=0.6-0.8 for diverse reasoning; temp=0 for deterministic math.
Common beginner mistakes
Mistake: Setting max_tokens=1024 and expecting complete reasoning chains. Fix: CoT adds 200-1000+ tokens before the answer. Set max_tokens=4096+ for complex reasoning tasks. Mistake: Displaying <think> blocks to end users. Fix: Parse and strip <think>...</think> blocks before displaying the final answer. Or fold them behind a "show reasoning" toggle. Mistake: Using R1-distill for simple classification tasks. Fix: The CoT overhead isn't worth it for binary/simple tasks. Use standard Qwen 3 32B instead. Mistake: Comparing R1-distill 32B to full R1 685B. Fix: Distillation transfers reasoning patterns but not the full model's capacity. The 32B distill is strong for its size but doesn't match the full 685B MoE.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- R1 reasoning + Qwen 3 base
- Apache 2.0
Weaknesses
- Newer ecosystem than the original Qwen 2.5 distill
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| AWQ-INT4 | 19.0 GB | 22 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of DeepSeek R1 Distill Qwen 3 32B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run DeepSeek R1 Distill Qwen 3 32B?
Can I use DeepSeek R1 Distill Qwen 3 32B commercially?
What's the context length of DeepSeek R1 Distill Qwen 3 32B?
Source: huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen3-32B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify DeepSeek R1 Distill Qwen 3 32B runs on your specific hardware before committing money.