Qwen 3 30B-A3B
Mid-tier Qwen 3 MoE. 30B total / 3B active means 70B-class quality at 7B-class inference speed on a single 24GB card. The sweet spot of the Qwen 3 lineup for prosumer hardware.
Positioning
Qwen 3 30B-A3B is the most operator-relevant Qwen MoE released so far. Where Qwen 3.5 235B-A17B and Qwen 3 235B-A22B need 128-GB+ unified memory or workstation-tier hardware, Qwen 3 30B-A3B fits comfortably on a single 24-GB consumer GPU at Q4 — meaning it runs on an RTX 3090 (used $700-1000), RTX 4090 ($1,400-1,900 used), RX 7900 XTX ($700-900), or any 24+ GB Mac. 30B total params with ~3B active per token: decode tok/s is closer to a 3B dense model than a 30B dense one, while quality lands meaningfully above 7B-class. The operator-grade pitch: this is the Qwen frontier you can actually run on the GPU you already own.
Strengths
- Fits 24 GB single-card hardware at Q4 — ~16-19 GB at Q4, leaves comfortable headroom for 32K context. Single RTX 3090 / 4090 / 7900 XTX runs it natively without partial offload.
- MoE efficiency is the architecture's killer feature. ~3B active params per token means decode speed approaches 7B-dense models, despite the 30B parameter count. Real-world: ~60-100 tok/s on consumer 24-GB cards.
- Apache 2.0 license — same permissive Qwen-team license as the bigger 235B variants. Commercial use unrestricted, no MAU clauses.
- Excellent multilingual + English combo. Same Qwen-strength on Chinese + 60+ languages. Outperforms most 7B-13B models on multilingual tasks while needing barely more memory.
- Day-zero tooling support. vLLM, SGLang, llama.cpp, Ollama all shipped Q3 30B-A3B compatibility within hours of release. Less tooling lag than the 235B family.
- Strong reasoning + coding combo for the active-parameter count. Doesn't beat 32B dense models like Qwen 2.5 Coder 32B on coding-specific benchmarks, but for general daily-driver work it's competitive at lower hardware cost.
Limitations
- Quality ceiling is below the 235B siblings. Q3 30B-A3B is "frontier-adjacent" — comfortably better than 7B-class, meaningfully behind the 235B-class. The right framing: this is the best you can run on a 24-GB card, not the best Qwen has.
- MoE expert routing isn't perfect. Some prompts activate suboptimal expert combinations, producing outputs that feel "off" compared to dense 32B models. Less of an issue with mature vLLM routing; more visible on llama.cpp implementations.
- 3B active parameters means it can be more brittle on edge cases vs a 32B dense model. For production pipelines that need predictable behavior, dense 32B (Qwen 2.5 32B Instruct, Qwen 2.5 Coder 32B) might be the more reliable pick.
- Effective context is ~32K despite the spec advertising more. Quality drops past 32K in our internal testing; not a 128K-effective model.
Real-world performance on RTX 4090 (24 GB)
- Q4_K_M (~17 GB): ~80-110 tok/s decode, TTFT ~80-150 ms on 1K prompts. The headline daily-driver workload.
- Q5_K_M (~20 GB): ~65-90 tok/s, slightly better quality, less context room.
- Q8_0 (~30 GB partial-offload): ~25-40 tok/s. Quality bump over Q4 is small; rarely worth the speed loss.
- Compare with: Qwen 2.5 32B Instruct at Q4 on same hardware: ~35-50 tok/s. Q3 30B-A3B's MoE efficiency wins by 2-3× on raw speed.
Should you run this locally?
Yes, for anyone with a 24-GB single GPU who wants frontier-adjacent quality at consumer hardware tier. This is the Qwen-MoE-on-a-4090 daily driver. The right pick for general assistant work, RAG pipelines, agent loops, and most coding tasks. If you have the hardware, run this.
Yes, for Mac Studio M-series operators who want a fast Qwen variant for daily use without committing to the 235B-tier hardware footprint. Q4 fits any M-class with 32 GB+ unified memory.
No, for anyone running a sub-16-GB card. Q4 needs ~17 GB; partial-offload doesn't make sense for a model that's already MoE-efficient. Use Qwen 3 8B or Qwen 2.5 7B instead at smaller-card tiers.
Probably not, for anyone whose primary workload is coding (Qwen 2.5 Coder 32B at Q4 fits 24 GB and outperforms on coding-specific benchmarks at the cost of slower decode).
Probably not, for anyone whose primary need is multilingual (where Q3 30B-A3B beats most alternatives but the dense-32B Qwen variants beat it slightly).
How it compares
- vs Qwen 3.5 235B-A17B (frontier) → 235B has higher quality ceiling but needs 128-GB+ hardware. 30B-A3B fits a 24-GB consumer card. Pick 30B-A3B for accessibility; pick 235B-A17B if you have the hardware AND need the extra quality. Different operator tiers.
- vs Qwen 3 235B-A22B (prior-gen frontier) → same hardware contrast as 3.5. The 30B-A3B is the consumer-tier answer to either 235B variant.
- vs Qwen 2.5 32B Instruct (dense, same param count, same VRAM) → 32B-Instruct is dense (32B compute per token vs 3B for the MoE). 32B-Instruct edges quality on most benchmarks; 30B-A3B is ~2-3× faster. Pick MoE for daily-driver speed; dense for production reliability.
- vs Qwen 2.5 Coder 32B (coding specialist) → Coder 32B beats Q3 30B-A3B on coding tasks. Pick Coder 32B if you're using Aider or Continue for serious code work; pick Q3 30B-A3B for general assistant + light coding.
- vs DeepSeek R1 Distill family (reasoning specialists) → R1 Distill 32B specializes in reasoning chain-of-thought. Q3 30B-A3B is generalist. Pick R1 Distill for math + logic puzzles; pick 30B-A3B for daily mixed-use.
- vs Llama 4 Scout → Scout has 128k effective context vs 30B-A3B's 32k. Llama license has 700M MAU clause; Apache 2.0 wins for most teams. Pick Scout for long-context; pick 30B-A3B for license simplicity.
Run this yourself
# RTX 4090 / 3090 / 7900 XTX — single-card 24 GB
ollama pull qwen3:30b-a3b
ollama run qwen3:30b-a3b
# For better runtime control via llama.cpp:
llama-server -m qwen3-30b-a3b-Q4_K_M.gguf \
--ctx-size 32768 -ngl 999 --temp 0.7
# For multi-user serving via vLLM (production-tier):
vllm serve Qwen/Qwen3-30B-A3B-Instruct \
--tensor-parallel-size 1 --max-model-len 32768
Quant: Q4_K_M GGUF
Context: 32768 (KV cache f16, ~2 GB additional)
Backend: llama.cpp via Ollama, CUDA 12.x
Hardware: RTX 4090, NVIDIA driver 555+
Overview
Mid-tier Qwen 3 MoE. 30B total / 3B active means 70B-class quality at 7B-class inference speed on a single 24GB card. The sweet spot of the Qwen 3 lineup for prosumer hardware.
Featured in this stack
The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Production tier·Role: MoE workload (3B active, 30B total)Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink
30B/3B-active MoE on dual-4090 PCIe is the throughput sweet spot. Expert routing across cards is bandwidth-friendlier than tensor parallelism for dense models, so the no-NVLink penalty is smaller. ~80 tok/s decode at 8 concurrent.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- 3B active params = fast inference
- Apache 2.0
- Thinking mode
Weaknesses
- Total weights still 18GB at Q4
- MoE routing varies in quality
Prompting kit
Tested patterns for getting the most out of Qwen 3 30B-A3B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.
Recommended system prompt
You are Qwen, a helpful assistant created by Alibaba Cloud. Answer the user's question directly and concisely. When the task requires step-by-step analysis, work through it carefully before giving the final answer.
Quirks to know
- •Mixture-of-experts: 30B total parameters, ~3B active per forward pass. Inference cost matches a 3B dense model; quality lands closer to a 14B dense model per Qwen's release notes.
- •Same /think and /no_think mode toggles as Qwen 3 32B dense. /think for reasoning, /no_think for short Q&A.
- •Native 32K context. Extendable to 128K with YaRN scaling — set rope_scaling factor to 4.0.
- •Uses ChatML format with <|im_start|> / <|im_end|> role tokens — confirm tokenizer_config.json template in your runtime.
- •MoE inference: requires more total VRAM (load all experts) than dense at same active count. Quantization works but Q4_K_M MoE quants are noisier than Q4_K_M dense — Qwen recommends Q5_K_M or Q6_K for MoE if VRAM allows.
Chat template
Same template as Qwen 3 32B — apply via runtime, not hand-rolled, because the /think toggle inserts an extra system marker.
Tool calling
Same Hermes-style format as Qwen 3 32B. Tools declared in system prompt, calls emitted as <tool_call>{...}</tool_call> blocks.
Sampler settings
- temperature
- 0.7
- top_p
- 0.8
- top_k
- 20
Vendor defaults. For /think mode, switch to temperature 0.6, top_p 0.95 per the model card.
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 18.0 GB | 22 GB |
| Q8_0 | 32.0 GB | 36 GB |
Get the model
Ollama
One-line install
ollama run qwen3:30bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Qwen 3 30B-A3B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Qwen 3 30B-A3B?
Can I use Qwen 3 30B-A3B commercially?
What's the context length of Qwen 3 30B-A3B?
How do I install Qwen 3 30B-A3B with Ollama?
Compare against other models
Curated head-to-head decisions where Qwen 3 30B-A3B is one of the contenders. For arbitrary pairings use /model-battle.
Source: huggingface.co/Qwen/Qwen3-30B-A3B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Qwen 3 30B-A3B runs on your specific hardware before committing money.