Qwen 3 32B
Dense Qwen 3 32B. Best dense open-weight model in its size class at release; pairs nicely with a single RTX 5090 or 4090.
Positioning
The new daily driver for RTX 3090 / 4090 / 5080 owners. Same VRAM footprint as Qwen 2.5 32B, materially better on reasoning thanks to thinking mode, similar speed in non-thinking. The right answer to "what runs on my 24 GB GPU?" today.
Strengths
- 19 GB at Q4_K_M — full GPU offload on 24 GB with 16K context.
- Hybrid reasoning lifts hard-task quality past Qwen 2.5 32B without VRAM cost.
- Multilingual carryover still strong.
Limitations
- Thinking-mode tokens cost real time — verbose intermediate reasoning eats throughput.
- License caps as before.
- Qwen 2.5 Coder 32B still beats it for coding — coder is a dedicated specialist.
Real-world performance on RTX 4090
- Q4_K_M (19.4 GB): 68–86 tok/s decode (non-thinking); same speed thinking, more tokens emitted
- Q5_K_M (22.9 GB): 56–70 tok/s
- Q8_0 (35 GB): partial offload, 18–24 tok/s
Should you run this locally?
Yes, for 24 GB single-card owners who want the strongest dense model with hybrid reasoning. The new default daily driver. No, for dedicated coding workflows (pick Qwen 2.5 Coder 32B), or hard reasoning where QwQ 32B's specialization wins.
How it compares
- vs Qwen 2.5 32B Instruct → Qwen 3 32B wins outright at the same VRAM. New work should default to Qwen 3.
- vs QwQ 32B → QwQ is the reasoning specialist; Qwen 3 32B is the generalist with optional reasoning. Pick QwQ for math/code reasoning, Qwen 3 32B for general chat.
- vs Llama 3.3 70B → Llama 3.3 70B is smarter but 3× slower on the same hardware. Qwen 3 32B is the productivity pick.
- vs Qwen 3 30B-A3B (MoE) → 30B-A3B is faster (~2× tok/s) due to MoE; Qwen 3 32B dense is steadier on instruction following.
Run this yourself
ollama pull qwen3:32b
ollama run qwen3:32b
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090
›Why this rating
8.9/10 — the 32B-class evolution of the Qwen 3 thinking-mode story. Stronger absolute capability than Qwen 2.5 32B, runs in the same VRAM. Replaces 2.5 32B as the default for 24 GB single-card daily-driver use.
Overview
Dense Qwen 3 32B. Best dense open-weight model in its size class at release; pairs nicely with a single RTX 5090 or 4090.
Featured in this stack
The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Workstation tier·Role: General model with reasoning toggleBuild a local reasoning-model stack (May 2026)
Qwen 3 32B has a reasoning-mode toggle (the <think> block convention) that you can enable per-query. Useful when most of your workload doesn't need reasoning — fall back to standard mode for chat, enable thinking for math / code / analysis.
Execution notes
Operator notes
Qwen 3 32B is the reasoning-toggle generation of the Qwen family. The architectural shift from Qwen 2.5: native `` reasoning blocks that toggle per-query. Strong reasoning when enabled (~comparable to DeepSeek R1 Distill Qwen 32B); fast chat when disabled (no reasoning-token tax). Apache 2.0.
The right pick when your workload mix is mostly chat with occasional reasoning needs — you don't pay the reasoning-block cost on simple queries.
Deployment notes
Production: vLLM + RTX 4090 24 GB + AWQ-INT4 quant + 32K context. Set `gpu-memory-utilization` to 0.85 (not the default 0.9) — reasoning-block emission pushes KV cache pressure higher than non-reasoning 32B models. The /stacks/local-reasoning-model recipe pairs this configuration with Open WebUI's reasoning-block rendering.
Workstation: 5090 32 GB or M4 Max 64 GB unified memory both fit comfortably with full headroom for concurrent users.
Multi-user: SGLang over vLLM if reasoning-mode is the dominant workload — RadixAttention's prefix-cache wins compound across reasoning queries with shared system prompts.
Runtime compatibility
- vLLM ✓ excellent. AWQ-INT4 supported; --enable-chunked-prefill non-optional for reasoning queries.
- SGLang ✓ excellent. RadixAttention pairs naturally with reasoning workloads.
- Ollama ✓ good. Q4_K_M GGUF available; loses concurrency benefits.
- MLX-LM ✓ good. Apple Silicon path; 32B in MLX-4bit fits 64 GB unified memory.
Quantization suitability
AWQ-INT4 is the production-recommended quant. KV cache pressure with reasoning-mode is higher than non-reasoning 32B — drop `gpu-memory-utilization` to 0.85 to leave headroom for reasoning-block emissions. Q4_K_M GGUF for the Ollama path; same caveats apply.
When to use a different model
- Coding-first: Qwen 3 Coder 32B — same family, coding-specialized fine-tune.
- Pure reasoning (no toggle needed): DeepSeek R1 Distill Qwen 32B — always-on reasoning, slightly stronger on math benchmarks.
- 16 GB VRAM tier: Qwen 3 14B — same family, reasoning toggle at smaller scale.
- Frontier-tier: cluster-deploy DeepSeek V4 — May 2026 open-weight benchmark leader.
Best use cases
- Mixed chat + reasoning workloads — toggle provides the right operating point for each query.
- Agent loops with selective reasoning — invoke reasoning-mode for plan-generation steps; standard mode for tool-call iterations.
- Multilingual workflows — Qwen family's CJK depth carries through; better than Llama on Chinese / Japanese.
- Apache 2.0 license required — drops in cleanly for commercial deployments without license review.
Failure modes
- Reasoning-block emission inside structured output. If the model is instructed to emit JSON but reasoning-mode is enabled, the thinking block can leak into the JSON output. Disable reasoning-mode for structured-output workflows.
- Token-cost runaway on reasoning chains. Reasoning blocks can emit 2000+ tokens; multiply by the agent loop's tool-call count and the cost compounds. Set per-query token budgets.
- KV cache OOM on long reasoning + long context. 32K context + 2000-token reasoning + system prompt + tool schemas = ~25 GB of KV cache pressure on a 24 GB card. Lower max_model_len or switch off reasoning for long-context tasks.
Going deeper
- /stacks/local-reasoning-model — the canonical reasoning deployment recipe
- /systems/agent-execution-systems — when reasoning helps agent workflows
- Qwen 3 14B, Qwen 3 7B, Qwen 2.5 72B Instruct — Qwen 3 family siblings
- DeepSeek R1 Distill Qwen 32B — the always-on-reasoning alternative
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Strongest dense ~30B model
- Apache 2.0
- Tool calling
Weaknesses
- Needs 24GB+ VRAM
Prompting kit
Tested patterns for getting the most out of Qwen 3 32B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.
Recommended system prompt
You are Qwen, a helpful assistant created by Alibaba Cloud. Answer the user's question directly and concisely. When the task requires step-by-step analysis, work through it carefully before giving the final answer.
Quirks to know
- •Supports a 'thinking mode' switch — append /think to enable visible chain-of-thought, /no_think to disable. Per the model card, /no_think is recommended for short Q&A; /think is recommended for math, code, and multi-step reasoning.
- •Native 32K context window. Per the model card, contexts up to 131K are reachable with YaRN scaling — set rope_scaling factor to 4.0 in your runtime config.
- •Hybrid reasoning: the same checkpoint handles both fast chat and deep reasoning depending on the /think toggle. No separate model required.
- •Uses ChatML format with <|im_start|> / <|im_end|> role tokens — confirm your runtime's chat template matches the one shipped in tokenizer_config.json.
- •Multilingual: officially supports 119 languages per the model card. Quality stays high in CJK languages; African and lower-resource languages may degrade.
Chat template
<|im_start|>{role}\n{content}<|im_end|>. The template ships in tokenizer_config.json — apply it via the runtime rather than hand-rolling, since the thinking-mode toggle inserts an extra system marker.
Tool calling
Per the model card, Qwen3 uses Hermes-style tool call format: tools declared in the system prompt, calls emitted as <tool_call>{...}</tool_call> blocks. Compatible with llama.cpp's --jinja mode and most agent frameworks.
Sampler settings
- temperature
- 0.7
- top_p
- 0.8
- top_k
- 20
Vendor-recommended defaults from the Qwen3 model card. For /think mode, the card recommends temperature 0.6 and top_p 0.95 instead — switch sampler when reasoning.
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 19.0 GB | 24 GB |
| Q5_K_M | 22.0 GB | 28 GB |
| Q8_0 | 34.0 GB | 40 GB |
Get the model
Ollama
One-line install
ollama run qwen3:32bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Qwen 3 32B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Qwen 3 32B?
Can I use Qwen 3 32B commercially?
What's the context length of Qwen 3 32B?
How do I install Qwen 3 32B with Ollama?
Compare against other models
Curated head-to-head decisions where Qwen 3 32B is one of the contenders. For arbitrary pairings use /model-battle.
Source: huggingface.co/Qwen/Qwen3-32B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Qwen 3 32B runs on your specific hardware before committing money.