qwen
32B parameters
Commercial OK

Qwen 3 32B

Dense Qwen 3 32B. Best dense open-weight model in its size class at release; pairs nicely with a single RTX 5090 or 4090.

License: Apache 2.0·Released Apr 29, 2025·Context: 131,072 tokens
Our verdict
By Fredoline Eruo·Last verified May 6, 2026
8.9/10
Positioning

The new daily driver for RTX 3090 / 4090 / 5080 owners. Same VRAM footprint as Qwen 2.5 32B, materially better on reasoning thanks to thinking mode, similar speed in non-thinking. The right answer to "what runs on my 24 GB GPU?" today.

Strengths
  • 19 GB at Q4_K_M — full GPU offload on 24 GB with 16K context.
  • Hybrid reasoning lifts hard-task quality past Qwen 2.5 32B without VRAM cost.
  • Multilingual carryover still strong.
Limitations
  • Thinking-mode tokens cost real time — verbose intermediate reasoning eats throughput.
  • License caps as before.
  • Qwen 2.5 Coder 32B still beats it for coding — coder is a dedicated specialist.
Real-world performance on RTX 4090
  • Q4_K_M (19.4 GB): 68–86 tok/s decode (non-thinking); same speed thinking, more tokens emitted
  • Q5_K_M (22.9 GB): 56–70 tok/s
  • Q8_0 (35 GB): partial offload, 18–24 tok/s
Should you run this locally?

Yes, for 24 GB single-card owners who want the strongest dense model with hybrid reasoning. The new default daily driver. No, for dedicated coding workflows (pick Qwen 2.5 Coder 32B), or hard reasoning where QwQ 32B's specialization wins.

How it compares
  • vs Qwen 2.5 32B Instruct → Qwen 3 32B wins outright at the same VRAM. New work should default to Qwen 3.
  • vs QwQ 32B → QwQ is the reasoning specialist; Qwen 3 32B is the generalist with optional reasoning. Pick QwQ for math/code reasoning, Qwen 3 32B for general chat.
  • vs Llama 3.3 70B → Llama 3.3 70B is smarter but 3× slower on the same hardware. Qwen 3 32B is the productivity pick.
  • vs Qwen 3 30B-A3B (MoE) → 30B-A3B is faster (~2× tok/s) due to MoE; Qwen 3 32B dense is steadier on instruction following.
Run this yourself
ollama pull qwen3:32b
ollama run qwen3:32b
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090
Why this rating

8.9/10 — the 32B-class evolution of the Qwen 3 thinking-mode story. Stronger absolute capability than Qwen 2.5 32B, runs in the same VRAM. Replaces 2.5 32B as the default for 24 GB single-card daily-driver use.

Overview

Dense Qwen 3 32B. Best dense open-weight model in its size class at release; pairs nicely with a single RTX 5090 or 4090.

Strengths

  • Strongest dense ~30B model
  • Apache 2.0
  • Tool calling

Weaknesses

  • Needs 24GB+ VRAM

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M19.0 GB24 GB
Q5_K_M22.0 GB28 GB
Q8_034.0 GB40 GB

Get the model

Ollama

One-line install

ollama run qwen3:32bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/Qwen/Qwen3-32B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Qwen 3 32B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Qwen 3 32B?

24GB of VRAM is enough to run Qwen 3 32B at the Q4_K_M quantization (file size 19.0 GB). Higher-quality quantizations need more.

Can I use Qwen 3 32B commercially?

Yes — Qwen 3 32B ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Qwen 3 32B?

Qwen 3 32B supports a context window of 131,072 tokens (about 131K).

How do I install Qwen 3 32B with Ollama?

Run `ollama pull qwen3:32b` to download, then `ollama run qwen3:32b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/Qwen/Qwen3-32B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.