Qwen 3 32B

Positioning

The new daily driver for RTX 3090 / 4090 / 5080 owners. Same VRAM footprint as Qwen 2.5 32B, materially better on reasoning thanks to thinking mode, similar speed in non-thinking. The right answer to "what runs on my 24 GB GPU?" today.

Strengths

19 GB at Q4_K_M — full GPU offload on 24 GB with 16K context.
Hybrid reasoning lifts hard-task quality past Qwen 2.5 32B without VRAM cost.
Multilingual carryover still strong.

Limitations

Thinking-mode tokens cost real time — verbose intermediate reasoning eats throughput.
License caps as before.
Qwen 2.5 Coder 32B still beats it for coding — coder is a dedicated specialist.

Real-world performance on RTX 4090

Q4_K_M (19.4 GB): 68–86 tok/s decode (non-thinking); same speed thinking, more tokens emitted
Q5_K_M (22.9 GB): 56–70 tok/s
Q8_0 (35 GB): partial offload, 18–24 tok/s

Should you run this locally?

Yes, for 24 GB single-card owners who want the strongest dense model with hybrid reasoning. The new default daily driver. No, for dedicated coding workflows (pick Qwen 2.5 Coder 32B), or hard reasoning where QwQ 32B's specialization wins.

How it compares

vs Qwen 2.5 32B Instruct → Qwen 3 32B wins outright at the same VRAM. New work should default to Qwen 3.
vs QwQ 32B → QwQ is the reasoning specialist; Qwen 3 32B is the generalist with optional reasoning. Pick QwQ for math/code reasoning, Qwen 3 32B for general chat.
vs Llama 3.3 70B → Llama 3.3 70B is smarter but 3× slower on the same hardware. Qwen 3 32B is the productivity pick.
vs Qwen 3 30B-A3B (MoE) → 30B-A3B is faster (~2× tok/s) due to MoE; Qwen 3 32B dense is steadier on instruction following.

Run this yourself

ollama pull qwen3:32b
ollama run qwen3:32b

Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090

Quantization	File size	VRAM required
Q4_K_M	19.0 GB	24 GB
Q5_K_M	22.0 GB	28 GB
Q8_0	34.0 GB	40 GB

Quantization

File size

VRAM required

Q4_K_M

19.0 GB

24 GB

Q5_K_M

22.0 GB

28 GB

Q8_0

34.0 GB

40 GB

Frequently asked

What's the minimum VRAM to run Qwen 3 32B?

24GB of VRAM is enough to run Qwen 3 32B at the Q4_K_M quantization (file size 19.0 GB). Higher-quality quantizations need more.

Can I use Qwen 3 32B commercially?

Yes — Qwen 3 32B ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Qwen 3 32B?

Qwen 3 32B supports a context window of 131,072 tokens (about 131K).

How do I install Qwen 3 32B with Ollama?

Run `ollama pull qwen3:32b` to download, then `ollama run qwen3:32b` to start a chat session. The default quantization is Q4_K_M.

Overview

Strengths

Weaknesses

Quantization variants

Get the model

Ollama

HuggingFace

Hardware that runs this

Models worth comparing

Frequently asked

What's the minimum VRAM to run Qwen 3 32B?

Can I use Qwen 3 32B commercially?

What's the context length of Qwen 3 32B?

How do I install Qwen 3 32B with Ollama?