Qwen 3 30B-A3B

Positioning

Qwen 3 30B-A3B is the most operator-relevant Qwen MoE released so far. Where Qwen 3.5 235B-A17B and Qwen 3 235B-A22B need 128-GB+ unified memory or workstation-tier hardware, Qwen 3 30B-A3B fits comfortably on a single 24-GB consumer GPU at Q4 — meaning it runs on an RTX 3090 (used $700-1000), RTX 4090 ($1,400-1,900 used), RX 7900 XTX ($700-900), or any 24+ GB Mac. 30B total params with ~3B active per token: decode tok/s is closer to a 3B dense model than a 30B dense one, while quality lands meaningfully above 7B-class. The operator-grade pitch: this is the Qwen frontier you can actually run on the GPU you already own.

Strengths

Fits 24 GB single-card hardware at Q4 — ~16-19 GB at Q4, leaves comfortable headroom for 32K context. Single RTX 3090 / 4090 / 7900 XTX runs it natively without partial offload.
MoE efficiency is the architecture's killer feature. ~3B active params per token means decode speed approaches 7B-dense models, despite the 30B parameter count. Real-world: ~60-100 tok/s on consumer 24-GB cards.
Apache 2.0 license — same permissive Qwen-team license as the bigger 235B variants. Commercial use unrestricted, no MAU clauses.
Excellent multilingual + English combo. Same Qwen-strength on Chinese + 60+ languages. Outperforms most 7B-13B models on multilingual tasks while needing barely more memory.
Day-zero tooling support. vLLM, SGLang, llama.cpp, Ollama all shipped Q3 30B-A3B compatibility within hours of release. Less tooling lag than the 235B family.
Strong reasoning + coding combo for the active-parameter count. Doesn't beat 32B dense models like Qwen 2.5 Coder 32B on coding-specific benchmarks, but for general daily-driver work it's competitive at lower hardware cost.

Limitations

Quality ceiling is below the 235B siblings. Q3 30B-A3B is "frontier-adjacent" — comfortably better than 7B-class, meaningfully behind the 235B-class. The right framing: this is the best you can run on a 24-GB card, not the best Qwen has.
MoE expert routing isn't perfect. Some prompts activate suboptimal expert combinations, producing outputs that feel "off" compared to dense 32B models. Less of an issue with mature vLLM routing; more visible on llama.cpp implementations.
3B active parameters means it can be more brittle on edge cases vs a 32B dense model. For production pipelines that need predictable behavior, dense 32B (Qwen 2.5 32B Instruct, Qwen 2.5 Coder 32B) might be the more reliable pick.
Effective context is ~32K despite the spec advertising more. Quality drops past 32K in our internal testing; not a 128K-effective model.

Real-world performance on RTX 4090 (24 GB)

Q4_K_M (~17 GB): ~80-110 tok/s decode, TTFT ~80-150 ms on 1K prompts. The headline daily-driver workload.
Q5_K_M (~20 GB): ~65-90 tok/s, slightly better quality, less context room.
Q8_0 (~30 GB partial-offload): ~25-40 tok/s. Quality bump over Q4 is small; rarely worth the speed loss.
Compare with: Qwen 2.5 32B Instruct at Q4 on same hardware: ~35-50 tok/s. Q3 30B-A3B's MoE efficiency wins by 2-3× on raw speed.

Should you run this locally?

Yes, for anyone with a 24-GB single GPU who wants frontier-adjacent quality at consumer hardware tier. This is the Qwen-MoE-on-a-4090 daily driver. The right pick for general assistant work, RAG pipelines, agent loops, and most coding tasks. If you have the hardware, run this.

Yes, for Mac Studio M-series operators who want a fast Qwen variant for daily use without committing to the 235B-tier hardware footprint. Q4 fits any M-class with 32 GB+ unified memory.

No, for anyone running a sub-16-GB card. Q4 needs ~17 GB; partial-offload doesn't make sense for a model that's already MoE-efficient. Use Qwen 3 8B or Qwen 2.5 7B instead at smaller-card tiers.

Probably not, for anyone whose primary workload is coding (Qwen 2.5 Coder 32B at Q4 fits 24 GB and outperforms on coding-specific benchmarks at the cost of slower decode).

Probably not, for anyone whose primary need is multilingual (where Q3 30B-A3B beats most alternatives but the dense-32B Qwen variants beat it slightly).

How it compares

vs Qwen 3.5 235B-A17B (frontier) → 235B has higher quality ceiling but needs 128-GB+ hardware. 30B-A3B fits a 24-GB consumer card. Pick 30B-A3B for accessibility; pick 235B-A17B if you have the hardware AND need the extra quality. Different operator tiers.
vs Qwen 3 235B-A22B (prior-gen frontier) → same hardware contrast as 3.5. The 30B-A3B is the consumer-tier answer to either 235B variant.
vs Qwen 2.5 32B Instruct (dense, same param count, same VRAM) → 32B-Instruct is dense (32B compute per token vs 3B for the MoE). 32B-Instruct edges quality on most benchmarks; 30B-A3B is ~2-3× faster. Pick MoE for daily-driver speed; dense for production reliability.
vs Qwen 2.5 Coder 32B (coding specialist) → Coder 32B beats Q3 30B-A3B on coding tasks. Pick Coder 32B if you're using Aider or Continue for serious code work; pick Q3 30B-A3B for general assistant + light coding.
vs DeepSeek R1 Distill family (reasoning specialists) → R1 Distill 32B specializes in reasoning chain-of-thought. Q3 30B-A3B is generalist. Pick R1 Distill for math + logic puzzles; pick 30B-A3B for daily mixed-use.
vs Llama 4 Scout → Scout has 128k effective context vs 30B-A3B's 32k. Llama license has 700M MAU clause; Apache 2.0 wins for most teams. Pick Scout for long-context; pick 30B-A3B for license simplicity.

Run this yourself

# RTX 4090 / 3090 / 7900 XTX — single-card 24 GB
ollama pull qwen3:30b-a3b
ollama run qwen3:30b-a3b

# For better runtime control via llama.cpp:
llama-server -m qwen3-30b-a3b-Q4_K_M.gguf \
  --ctx-size 32768 -ngl 999 --temp 0.7

# For multi-user serving via vLLM (production-tier):
vllm serve Qwen/Qwen3-30B-A3B-Instruct \
  --tensor-parallel-size 1 --max-model-len 32768

Quant: Q4_K_M GGUF Context: 32768 (KV cache f16, ~2 GB additional) Backend: llama.cpp via Ollama, CUDA 12.x Hardware: RTX 4090, NVIDIA driver 555+

Featured in this stack

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

Stack · L3·Production tier·Role: MoE workload (3B active, 30B total)

Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink

30B/3B-active MoE on dual-4090 PCIe is the throughput sweet spot. Expert routing across cards is bandwidth-friendlier than tensor parallelism for dense models, so the no-NVLink penalty is smaller. ~80 tok/s decode at 8 concurrent.

You are Qwen, a helpful assistant created by Alibaba Cloud. Answer the user's question directly and concisely. When the task requires step-by-step analysis, work through it carefully before giving the final answer.

Quantization	File size	VRAM required
Q4_K_M	18.0 GB	22 GB
Q8_0	32.0 GB	36 GB

Quantization

File size

VRAM required

Q4_K_M

18.0 GB

22 GB

Q8_0

32.0 GB

36 GB

Frequently asked

What's the minimum VRAM to run Qwen 3 30B-A3B?

22GB of VRAM is enough to run Qwen 3 30B-A3B at the Q4_K_M quantization (file size 18.0 GB). Higher-quality quantizations need more.

Can I use Qwen 3 30B-A3B commercially?

Yes — Qwen 3 30B-A3B ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Qwen 3 30B-A3B?

Qwen 3 30B-A3B supports a context window of 131,072 tokens (about 131K).

How do I install Qwen 3 30B-A3B with Ollama?

Run `ollama pull qwen3:30b` to download, then `ollama run qwen3:30b` to start a chat session. The default quantization is Q4_K_M.

Our verdict

Positioning

Strengths

Limitations

Real-world performance on RTX 4090 (24 GB)

Should you run this locally?

How it compares

Run this yourself

Overview

Featured in this stack

Family & lineage

Strengths

Weaknesses

Prompting kit

Recommended system prompt

Quirks to know

Chat template

Tool calling

Sampler settings

Quantization variants

Get the model

Ollama

HuggingFace

Hardware that runs this

Models worth comparing

Frequently asked

What's the minimum VRAM to run Qwen 3 30B-A3B?

Can I use Qwen 3 30B-A3B commercially?

What's the context length of Qwen 3 30B-A3B?

How do I install Qwen 3 30B-A3B with Ollama?

Compare against other models

Related — keep moving