deepseek
32B parameters
Commercial OK
Reviewed June 2026

DeepSeek R1 Distill Qwen 32B

32B distill — fits on a single 24GB card with reasoning capability. Best price-per-thinking-token combo for prosumers.

License: MIT·Released Jan 20, 2025·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
8.8/10

Positioning

The "best reasoning model that runs full-GPU on a single 24 GB card." If you can't accept the 70B distill's offload speeds (22–28 tok/s) and want pure-VRAM throughput (70+ tok/s) with serious reasoning training, this is the pick.

Strengths

  • 19 GB at Q4_K_M — full GPU on 24 GB, no offload, 70+ tok/s.
  • R1-class reasoning training — closes most of the gap vs base Qwen 2.5 32B on math/code.
  • Qwen license — same MAU caps as base Qwen 2.5 32B.

Limitations

  • Below the 70B Distill on absolute reasoning ceiling.
  • Verbose chain-of-thought — same token-cost concern as other reasoning models.
  • Generalist quality slightly lags base Qwen 2.5 32B for simple chat.

Real-world performance on RTX 4090

  • Q4_K_M (19.4 GB): 68–86 tok/s decode (with chain-of-thought verbosity)
  • Q5_K_M (22.9 GB): 56–70 tok/s
  • Q8_0 (35 GB): partial offload, 18–24 tok/s

Should you run this locally?

Yes, for 24 GB GPU owners who want strong reasoning at full-GPU speed. Best speed-quality tradeoff in the reasoning space. No, for users who can accept 70B's offload speed — pick R1 Distill Llama 70B for higher reasoning ceiling.

How it compares

  • vs DeepSeek R1 Distill Llama 70B → 70B is smarter; this 32B is much faster (full GPU). Pick by speed-vs-quality.
  • vs QwQ 32B → similar size, R1 Distill wins on hardest reasoning; QwQ has slightly cleaner everyday traces. R1 Distill is the stronger pick for math/code planning.
  • vs Qwen 3 32B with thinking mode → Qwen 3 32B is more flexible (thinking toggle); R1 Distill has more aggressive reasoning training. Coin flip.

Run this yourself

ollama pull deepseek-r1:32b-distill-qwen-q4_K_M
ollama run deepseek-r1:32b-distill-qwen-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090
Why this rating

8.8/10 — the right reasoning model for users who want full-GPU offload on a 24 GB card. Distills R1 reasoning into the Qwen 2.5 32B body — fits in 19 GB at Q4, no system-RAM partial-offload required. Loses fractional points to the 70B distill on absolute reasoning quality.

Overview

32B distill — fits on a single 24GB card with reasoning capability. Best price-per-thinking-token combo for prosumers.

Featured in these stacks

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Workstation tier·Role: Primary reasoning model
    Build a local reasoning-model stack (May 2026)

    DeepSeek R1 Distill Qwen 32B is the reasoning model that actually runs at 24GB VRAM via AWQ-INT4. Stronger reasoning quality per parameter than the full DeepSeek R1 (which needs ~700GB and is impossible locally). Distill gives ~80% of R1's reasoning at 5% of the VRAM.

  • Stack · L3·Production tier·Role: 32B reasoning model
    Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink

    32B class on dual-4090 leaves substantial headroom — 64K context fits comfortably. R1 distill brings explicit thinking-mode emission for hard reasoning tasks. The L1.25-enriched workstation reasoning canonical.

Execution notes

L1.25 enriched

Operator notes

DeepSeek R1 Distill Qwen 32B is the canonical local-reasoning-model deployment in May 2026. Captures ~80% of full DeepSeek R1's reasoning quality in 5% of the VRAM. Apache 2.0. The /stacks/local-reasoning-model recipe is built around this configuration.

The honest framing: full DeepSeek R1 isn't realistically deployable locally (~700GB at any quant). The distill is what makes serious reasoning workloads viable on consumer hardware. Distilled into a Qwen 2.5 32B base; preserves R1's reasoning-token emission discipline while fitting RTX 4090 / 5090 / M3 Max class hardware.

Deployment notes

The /stacks/local-reasoning-model recipe pairs this model with vLLM on RTX 4090 24GB with AWQ-INT4 + 32K context + chunked prefill. Hits ~32 tok/s decode after the initial 1500-3000 thinking tokens emission. End-to-end on a typical reasoning query: 50-90 seconds wall-clock — slower than chat models, much higher quality on math / code / analysis.

For 16GB VRAM, drop to DeepSeek R1 Distill Qwen 14B — meaningfully less reasoning depth but viable.

For Apple Silicon, MLX-LM on M3 Max 64GB / M4 Max handles the 32B distill comfortably; expect ~24-28 tok/s decode (vs RTX 4090's ~32).

For multi-user serving, SGLang over vLLM — RadixAttention's prefix-cache wins compound on reasoning workloads where the system prompt is stable across queries.

Runtime compatibility

  • vLLM ✓ excellent. AWQ-INT4 + chunked prefill; the production-default path. Use `--enforce-eager` to avoid CUDA graph compilation issues some R1 distill versions trigger.
  • SGLang ✓ excellent. Same quant; stronger on multi-user concurrency where the reasoning system prompt is shared across sessions.
  • Ollama ✓ good. Q4_K_M GGUF available; loses concurrency benefits but wins on single-user setup time.
  • MLX-LM ✓ good. Apple Silicon path with MLX-4bit quant.
  • TensorRT-LLM ✗ not recommended. Recompile-per-config friction kills agent-loop iteration speed.

Quantization suitability

AWQ-INT4 fits 22GB with reasoning-block headroom on a 24GB card. Drop `gpu-memory-utilization` to 0.85 (not 0.9 default) — KV cache pressure is higher than non-reasoning 32B models because reasoning queries emit 1500-3000+ tokens of thinking blocks before the answer. Q3-class loses ~6% on math benchmarks; meaningful enough that we don't recommend it.

Best use cases

  • Single-machine reasoning — math, multi-step analysis, complex code synthesis. The 32B distill hits the sweet spot of "frontier-quality reasoning at consumer hardware."
  • Agent planning steps — the Anthropic-style "thinking → planning → execution" pattern pairs naturally; use this model for the planning step and a faster non-reasoning model for the execution loop.
  • Math / scientific computing — verified accuracy on advanced math benchmarks rivals closed-source.
  • Research / academic reasoning — Apache 2.0 license clean for any deployment.

When to use a different model

Failure modes specific to this model

  1. Reasoning blocks leak into structured output. If the model is instructed to emit JSON, the `` block can break the parse. Strip thinking tokens or instruct the model to skip reasoning for structured queries.
  2. Sampler config sensitivity. Reasoning models are more sensitive to sampler parameters than chat models. Use temperature 0.6-0.8; the chat default of 1.0 produces meaningfully worse reasoning.
  3. Premature stopping on EOS during reasoning. Some configs treat `` as a stop token. Verify stop-token list excludes reasoning-block delimiters.
  4. Token-cost runaway. A reasoning chain of 5000 tokens on a 4-step agent loop costs 20K tokens before the agent does anything else. Set per-task token budgets.

Going deeper

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Strengths

  • MIT
  • Single-24GB-card reasoner

Weaknesses

  • Verbose CoT inflates output cost

Prompting kit

From model card
source

Tested patterns for getting the most out of DeepSeek R1 Distill Qwen 32B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Quirks to know

  • Distilled from DeepSeek-R1 into Qwen2.5 32B. Retains R1's <think>...</think> reasoning behavior per the model card.
  • No system prompt — same DeepSeek rule. Put instructions in the user message.
  • Uses Qwen's ChatML chat template (NOT DeepSeek's pipe markers) because this is a Qwen2.5 fine-tune.
  • Per the model card, this is the recommended R1 distill for 24-32GB-VRAM rigs — it nearly matches R1-Distill-Llama-70B at half the VRAM footprint.
  • Tool calling is not officially supported (same as the Llama distill). The distillation overwrites the base model's tool-call tuning.

Chat template

ChatML (Qwen2.5)

<|im_start|>{role}\n{content}<|im_end|>. Same template as base Qwen2.5 32B. Ships in tokenizer_config.json.

Tool calling

✗ Not supported

Tool calling unreliable post-distillation per the model card. For tool use on Qwen, fall back to base Qwen2.5 32B Instruct or Qwen 3 32B.

Sampler settings

temperature
0.6
top_p
0.95

Per the model card, recommended sampling is in the 0.5-0.7 range. Same as base R1.

Browse prompting kits for every model →/prompting

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M19.0 GB24 GB
Q8_034.0 GB40 GB

Get the model

Ollama

One-line install

ollama run deepseek-r1:32bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of DeepSeek R1 Distill Qwen 32B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run DeepSeek R1 Distill Qwen 32B?

24GB of VRAM is enough to run DeepSeek R1 Distill Qwen 32B at the Q4_K_M quantization (file size 19.0 GB). Higher-quality quantizations need more.

Can I use DeepSeek R1 Distill Qwen 32B commercially?

Yes — DeepSeek R1 Distill Qwen 32B ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of DeepSeek R1 Distill Qwen 32B?

DeepSeek R1 Distill Qwen 32B supports a context window of 131,072 tokens (about 131K).

How do I install DeepSeek R1 Distill Qwen 32B with Ollama?

Run `ollama pull deepseek-r1:32b` to download, then `ollama run deepseek-r1:32b` to start a chat session. The default quantization is Q4_K_M.

Compare against other models

Curated head-to-head decisions where DeepSeek R1 Distill Qwen 32B is one of the contenders. For arbitrary pairings use /model-battle.

Source: huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify DeepSeek R1 Distill Qwen 32B runs on your specific hardware before committing money.