qwen
32B parameters
Commercial OK
Reviewed June 2026

Qwen 3 32B

Dense Qwen 3 32B. Best dense open-weight model in its size class at release; pairs nicely with a single RTX 5090 or 4090.

License: Apache 2.0·Released Apr 29, 2025·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
8.9/10

Positioning

The new daily driver for RTX 3090 / 4090 / 5080 owners. Same VRAM footprint as Qwen 2.5 32B, materially better on reasoning thanks to thinking mode, similar speed in non-thinking. The right answer to "what runs on my 24 GB GPU?" today.

Strengths

  • 19 GB at Q4_K_M — full GPU offload on 24 GB with 16K context.
  • Hybrid reasoning lifts hard-task quality past Qwen 2.5 32B without VRAM cost.
  • Multilingual carryover still strong.

Limitations

  • Thinking-mode tokens cost real time — verbose intermediate reasoning eats throughput.
  • License caps as before.
  • Qwen 2.5 Coder 32B still beats it for coding — coder is a dedicated specialist.

Real-world performance on RTX 4090

  • Q4_K_M (19.4 GB): 68–86 tok/s decode (non-thinking); same speed thinking, more tokens emitted
  • Q5_K_M (22.9 GB): 56–70 tok/s
  • Q8_0 (35 GB): partial offload, 18–24 tok/s

Should you run this locally?

Yes, for 24 GB single-card owners who want the strongest dense model with hybrid reasoning. The new default daily driver. No, for dedicated coding workflows (pick Qwen 2.5 Coder 32B), or hard reasoning where QwQ 32B's specialization wins.

How it compares

  • vs Qwen 2.5 32B Instruct → Qwen 3 32B wins outright at the same VRAM. New work should default to Qwen 3.
  • vs QwQ 32B → QwQ is the reasoning specialist; Qwen 3 32B is the generalist with optional reasoning. Pick QwQ for math/code reasoning, Qwen 3 32B for general chat.
  • vs Llama 3.3 70B → Llama 3.3 70B is smarter but 3× slower on the same hardware. Qwen 3 32B is the productivity pick.
  • vs Qwen 3 30B-A3B (MoE) → 30B-A3B is faster (~2× tok/s) due to MoE; Qwen 3 32B dense is steadier on instruction following.

Run this yourself

ollama pull qwen3:32b
ollama run qwen3:32b
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090
Why this rating

8.9/10 — the 32B-class evolution of the Qwen 3 thinking-mode story. Stronger absolute capability than Qwen 2.5 32B, runs in the same VRAM. Replaces 2.5 32B as the default for 24 GB single-card daily-driver use.

Overview

Dense Qwen 3 32B. Best dense open-weight model in its size class at release; pairs nicely with a single RTX 5090 or 4090.

Featured in this stack

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Workstation tier·Role: General model with reasoning toggle
    Build a local reasoning-model stack (May 2026)

    Qwen 3 32B has a reasoning-mode toggle (the <think> block convention) that you can enable per-query. Useful when most of your workload doesn't need reasoning — fall back to standard mode for chat, enable thinking for math / code / analysis.

Execution notes

L1.25 enriched

Operator notes

Qwen 3 32B is the reasoning-toggle generation of the Qwen family. The architectural shift from Qwen 2.5: native `` reasoning blocks that toggle per-query. Strong reasoning when enabled (~comparable to DeepSeek R1 Distill Qwen 32B); fast chat when disabled (no reasoning-token tax). Apache 2.0.

The right pick when your workload mix is mostly chat with occasional reasoning needs — you don't pay the reasoning-block cost on simple queries.

Deployment notes

Production: vLLM + RTX 4090 24 GB + AWQ-INT4 quant + 32K context. Set `gpu-memory-utilization` to 0.85 (not the default 0.9) — reasoning-block emission pushes KV cache pressure higher than non-reasoning 32B models. The /stacks/local-reasoning-model recipe pairs this configuration with Open WebUI's reasoning-block rendering.

Workstation: 5090 32 GB or M4 Max 64 GB unified memory both fit comfortably with full headroom for concurrent users.

Multi-user: SGLang over vLLM if reasoning-mode is the dominant workload — RadixAttention's prefix-cache wins compound across reasoning queries with shared system prompts.

Runtime compatibility

  • vLLM ✓ excellent. AWQ-INT4 supported; --enable-chunked-prefill non-optional for reasoning queries.
  • SGLang ✓ excellent. RadixAttention pairs naturally with reasoning workloads.
  • Ollama ✓ good. Q4_K_M GGUF available; loses concurrency benefits.
  • MLX-LM ✓ good. Apple Silicon path; 32B in MLX-4bit fits 64 GB unified memory.

Quantization suitability

AWQ-INT4 is the production-recommended quant. KV cache pressure with reasoning-mode is higher than non-reasoning 32B — drop `gpu-memory-utilization` to 0.85 to leave headroom for reasoning-block emissions. Q4_K_M GGUF for the Ollama path; same caveats apply.

When to use a different model

  • Coding-first: Qwen 3 Coder 32B — same family, coding-specialized fine-tune.
  • Pure reasoning (no toggle needed): DeepSeek R1 Distill Qwen 32B — always-on reasoning, slightly stronger on math benchmarks.
  • 16 GB VRAM tier: Qwen 3 14B — same family, reasoning toggle at smaller scale.
  • Frontier-tier: cluster-deploy DeepSeek V4 — May 2026 open-weight benchmark leader.

Best use cases

  • Mixed chat + reasoning workloads — toggle provides the right operating point for each query.
  • Agent loops with selective reasoning — invoke reasoning-mode for plan-generation steps; standard mode for tool-call iterations.
  • Multilingual workflows — Qwen family's CJK depth carries through; better than Llama on Chinese / Japanese.
  • Apache 2.0 license required — drops in cleanly for commercial deployments without license review.

Failure modes

  1. Reasoning-block emission inside structured output. If the model is instructed to emit JSON but reasoning-mode is enabled, the thinking block can leak into the JSON output. Disable reasoning-mode for structured-output workflows.
  2. Token-cost runaway on reasoning chains. Reasoning blocks can emit 2000+ tokens; multiply by the agent loop's tool-call count and the cost compounds. Set per-query token budgets.
  3. KV cache OOM on long reasoning + long context. 32K context + 2000-token reasoning + system prompt + tool schemas = ~25 GB of KV cache pressure on a 24 GB card. Lower max_model_len or switch off reasoning for long-context tasks.

Going deeper

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Distilled / fine-tuned from this

Strengths

  • Strongest dense ~30B model
  • Apache 2.0
  • Tool calling

Weaknesses

  • Needs 24GB+ VRAM

Prompting kit

From model card
source

Tested patterns for getting the most out of Qwen 3 32B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Recommended system prompt

You are Qwen, a helpful assistant created by Alibaba Cloud. Answer the user's question directly and concisely. When the task requires step-by-step analysis, work through it carefully before giving the final answer.

Quirks to know

  • Supports a 'thinking mode' switch — append /think to enable visible chain-of-thought, /no_think to disable. Per the model card, /no_think is recommended for short Q&A; /think is recommended for math, code, and multi-step reasoning.
  • Native 32K context window. Per the model card, contexts up to 131K are reachable with YaRN scaling — set rope_scaling factor to 4.0 in your runtime config.
  • Hybrid reasoning: the same checkpoint handles both fast chat and deep reasoning depending on the /think toggle. No separate model required.
  • Uses ChatML format with <|im_start|> / <|im_end|> role tokens — confirm your runtime's chat template matches the one shipped in tokenizer_config.json.
  • Multilingual: officially supports 119 languages per the model card. Quality stays high in CJK languages; African and lower-resource languages may degrade.

Chat template

ChatML (Qwen3 variant)

<|im_start|>{role}\n{content}<|im_end|>. The template ships in tokenizer_config.json — apply it via the runtime rather than hand-rolling, since the thinking-mode toggle inserts an extra system marker.

Tool calling

✓ Supported(hermes-style)

Per the model card, Qwen3 uses Hermes-style tool call format: tools declared in the system prompt, calls emitted as <tool_call>{...}</tool_call> blocks. Compatible with llama.cpp's --jinja mode and most agent frameworks.

Sampler settings

temperature
0.7
top_p
0.8
top_k
20

Vendor-recommended defaults from the Qwen3 model card. For /think mode, the card recommends temperature 0.6 and top_p 0.95 instead — switch sampler when reasoning.

Browse prompting kits for every model →/prompting

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M19.0 GB24 GB
Q5_K_M22.0 GB28 GB
Q8_034.0 GB40 GB

Get the model

Ollama

One-line install

ollama run qwen3:32bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/Qwen/Qwen3-32B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Qwen 3 32B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Qwen 3 32B?

24GB of VRAM is enough to run Qwen 3 32B at the Q4_K_M quantization (file size 19.0 GB). Higher-quality quantizations need more.

Can I use Qwen 3 32B commercially?

Yes — Qwen 3 32B ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Qwen 3 32B?

Qwen 3 32B supports a context window of 131,072 tokens (about 131K).

How do I install Qwen 3 32B with Ollama?

Run `ollama pull qwen3:32b` to download, then `ollama run qwen3:32b` to start a chat session. The default quantization is Q4_K_M.

Compare against other models

Curated head-to-head decisions where Qwen 3 32B is one of the contenders. For arbitrary pairings use /model-battle.

Source: huggingface.co/Qwen/Qwen3-32B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Qwen 3 32B runs on your specific hardware before committing money.