qwen

32B parameters

Commercial OK

Reviewed June 2026

Qwen 3 32B

Dense Qwen 3 32B. Best dense open-weight model in its size class at release; pairs nicely with a single RTX 5090 or 4090.

License: Apache 2.0·Released Apr 29, 2025·Context: 131,072 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

8.9/10

Positioning

The new daily driver for RTX 3090 / 4090 / 5080 owners. Same VRAM footprint as Qwen 2.5 32B, materially better on reasoning thanks to thinking mode, similar speed in non-thinking. The right answer to "what runs on my 24 GB GPU?" today.

Strengths

19 GB at Q4_K_M — full GPU offload on 24 GB with 16K context.
Hybrid reasoning lifts hard-task quality past Qwen 2.5 32B without VRAM cost.
Multilingual carryover still strong.

Limitations

Thinking-mode tokens cost real time — verbose intermediate reasoning eats throughput.
License caps as before.
Qwen 2.5 Coder 32B still beats it for coding — coder is a dedicated specialist.

Real-world performance on RTX 4090

Q4_K_M (19.4 GB): 68–86 tok/s decode (non-thinking); same speed thinking, more tokens emitted
Q5_K_M (22.9 GB): 56–70 tok/s
Q8_0 (35 GB): partial offload, 18–24 tok/s

Should you run this locally?

Yes, for 24 GB single-card owners who want the strongest dense model with hybrid reasoning. The new default daily driver. No, for dedicated coding workflows (pick Qwen 2.5 Coder 32B), or hard reasoning where QwQ 32B's specialization wins.

How it compares

vs Qwen 2.5 32B Instruct → Qwen 3 32B wins outright at the same VRAM. New work should default to Qwen 3.
vs QwQ 32B → QwQ is the reasoning specialist; Qwen 3 32B is the generalist with optional reasoning. Pick QwQ for math/code reasoning, Qwen 3 32B for general chat.
vs Llama 3.3 70B → Llama 3.3 70B is smarter but 3× slower on the same hardware. Qwen 3 32B is the productivity pick.
vs Qwen 3 30B-A3B (MoE) → 30B-A3B is faster (~2× tok/s) due to MoE; Qwen 3 32B dense is steadier on instruction following.

Run this yourself

ollama pull qwen3:32b
ollama run qwen3:32b

Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090

›Why this rating

8.9/10 — the 32B-class evolution of the Qwen 3 thinking-mode story. Stronger absolute capability than Qwen 2.5 32B, runs in the same VRAM. Replaces 2.5 32B as the default for 24 GB single-card daily-driver use.

Overview

Dense Qwen 3 32B. Best dense open-weight model in its size class at release; pairs nicely with a single RTX 5090 or 4090.

Featured in this stack

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

Stack · L3·Workstation tier·Role: General model with reasoning toggle
Build a local reasoning-model stack (May 2026)
Qwen 3 32B has a reasoning-mode toggle (the <think> block convention) that you can enable per-query. Useful when most of your workload doesn't need reasoning — fall back to standard mode for chat, enable thinking for math / code / analysis.

Execution notes

L1.25 enriched

Operator notes

Qwen 3 32B is the reasoning-toggle generation of the Qwen family. The architectural shift from Qwen 2.5: native `` reasoning blocks that toggle per-query. Strong reasoning when enabled (~comparable to DeepSeek R1 Distill Qwen 32B); fast chat when disabled (no reasoning-token tax). Apache 2.0.

The right pick when your workload mix is mostly chat with occasional reasoning needs — you don't pay the reasoning-block cost on simple queries.

Deployment notes

Production: vLLM + RTX 4090 24 GB + AWQ-INT4 quant + 32K context. Set `gpu-memory-utilization` to 0.85 (not the default 0.9) — reasoning-block emission pushes KV cache pressure higher than non-reasoning 32B models. The /stacks/local-reasoning-model recipe pairs this configuration with Open WebUI's reasoning-block rendering.

Workstation: 5090 32 GB or M4 Max 64 GB unified memory both fit comfortably with full headroom for concurrent users.

Multi-user: SGLang over vLLM if reasoning-mode is the dominant workload — RadixAttention's prefix-cache wins compound across reasoning queries with shared system prompts.

Runtime compatibility

vLLM ✓ excellent. AWQ-INT4 supported; --enable-chunked-prefill non-optional for reasoning queries.
SGLang ✓ excellent. RadixAttention pairs naturally with reasoning workloads.
Ollama ✓ good. Q4_K_M GGUF available; loses concurrency benefits.
MLX-LM ✓ good. Apple Silicon path; 32B in MLX-4bit fits 64 GB unified memory.

Quantization suitability

AWQ-INT4 is the production-recommended quant. KV cache pressure with reasoning-mode is higher than non-reasoning 32B — drop `gpu-memory-utilization` to 0.85 to leave headroom for reasoning-block emissions. Q4_K_M GGUF for the Ollama path; same caveats apply.

When to use a different model

Coding-first: Qwen 3 Coder 32B — same family, coding-specialized fine-tune.
Pure reasoning (no toggle needed): DeepSeek R1 Distill Qwen 32B — always-on reasoning, slightly stronger on math benchmarks.
16 GB VRAM tier: Qwen 3 14B — same family, reasoning toggle at smaller scale.
Frontier-tier: cluster-deploy DeepSeek V4 — May 2026 open-weight benchmark leader.

Best use cases

Mixed chat + reasoning workloads — toggle provides the right operating point for each query.
Agent loops with selective reasoning — invoke reasoning-mode for plan-generation steps; standard mode for tool-call iterations.
Multilingual workflows — Qwen family's CJK depth carries through; better than Llama on Chinese / Japanese.
Apache 2.0 license required — drops in cleanly for commercial deployments without license review.

Failure modes

Reasoning-block emission inside structured output. If the model is instructed to emit JSON but reasoning-mode is enabled, the thinking block can leak into the JSON output. Disable reasoning-mode for structured-output workflows.
Token-cost runaway on reasoning chains. Reasoning blocks can emit 2000+ tokens; multiply by the agent loop's tool-call count and the cost compounds. Set per-query token budgets.
KV cache OOM on long reasoning + long context. 32K context + 2000-token reasoning + system prompt + tool schemas = ~25 GB of KV cache pressure on a 24 GB card. Lower max_model_len or switch off reasoning for long-context tasks.

Going deeper

/stacks/local-reasoning-model — the canonical reasoning deployment recipe
/systems/agent-execution-systems — when reasoning helps agent workflows
Qwen 3 14B, Qwen 3 7B, Qwen 2.5 72B Instruct — Qwen 3 family siblings
DeepSeek R1 Distill Qwen 32B — the always-on-reasoning alternative

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (qwen-3)

Qwen 3 32B32B

You are here

Qwen 3 235B-A22B235B

Frontier

Distilled / fine-tuned from this

Strengths

Strongest dense ~30B model
Apache 2.0
Tool calling

Weaknesses

Needs 24GB+ VRAM

Prompting kit

From model card

source

Tested patterns for getting the most out of Qwen 3 32B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Recommended system prompt

You are Qwen, a helpful assistant created by Alibaba Cloud. Answer the user's question directly and concisely. When the task requires step-by-step analysis, work through it carefully before giving the final answer.

Quirks to know

•Supports a 'thinking mode' switch — append /think to enable visible chain-of-thought, /no_think to disable. Per the model card, /no_think is recommended for short Q&A; /think is recommended for math, code, and multi-step reasoning.
•Native 32K context window. Per the model card, contexts up to 131K are reachable with YaRN scaling — set rope_scaling factor to 4.0 in your runtime config.
•Hybrid reasoning: the same checkpoint handles both fast chat and deep reasoning depending on the /think toggle. No separate model required.
•Uses ChatML format with <|im_start|> / <|im_end|> role tokens — confirm your runtime's chat template matches the one shipped in tokenizer_config.json.
•Multilingual: officially supports 119 languages per the model card. Quality stays high in CJK languages; African and lower-resource languages may degrade.

Chat template

ChatML (Qwen3 variant)

<|im_start|>{role}\n{content}<|im_end|>. The template ships in tokenizer_config.json — apply it via the runtime rather than hand-rolling, since the thinking-mode toggle inserts an extra system marker.

Tool calling

✓ Supported(hermes-style)

Per the model card, Qwen3 uses Hermes-style tool call format: tools declared in the system prompt, calls emitted as <tool_call>{...}</tool_call> blocks. Compatible with llama.cpp's --jinja mode and most agent frameworks.

Sampler settings

temperature: 0.7
top_p: 0.8
top_k: 20

Vendor-recommended defaults from the Qwen3 model card. For /think mode, the card recommends temperature 0.6 and top_p 0.95 instead — switch sampler when reasoning.

Browse prompting kits for every model →/prompting