qwen
30B parameters
Commercial OK
Reviewed June 2026

Qwen 3 30B-A3B

Mid-tier Qwen 3 MoE. 30B total / 3B active means 70B-class quality at 7B-class inference speed on a single 24GB card. The sweet spot of the Qwen 3 lineup for prosumer hardware.

License: Apache 2.0·Released Apr 29, 2025·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
unrated

Positioning

Qwen 3 30B-A3B is the most operator-relevant Qwen MoE released so far. Where Qwen 3.5 235B-A17B and Qwen 3 235B-A22B need 128-GB+ unified memory or workstation-tier hardware, Qwen 3 30B-A3B fits comfortably on a single 24-GB consumer GPU at Q4 — meaning it runs on an RTX 3090 (used $700-1000), RTX 4090 ($1,400-1,900 used), RX 7900 XTX ($700-900), or any 24+ GB Mac. 30B total params with ~3B active per token: decode tok/s is closer to a 3B dense model than a 30B dense one, while quality lands meaningfully above 7B-class. The operator-grade pitch: this is the Qwen frontier you can actually run on the GPU you already own.

Strengths

  • Fits 24 GB single-card hardware at Q4 — ~16-19 GB at Q4, leaves comfortable headroom for 32K context. Single RTX 3090 / 4090 / 7900 XTX runs it natively without partial offload.
  • MoE efficiency is the architecture's killer feature. ~3B active params per token means decode speed approaches 7B-dense models, despite the 30B parameter count. Real-world: ~60-100 tok/s on consumer 24-GB cards.
  • Apache 2.0 license — same permissive Qwen-team license as the bigger 235B variants. Commercial use unrestricted, no MAU clauses.
  • Excellent multilingual + English combo. Same Qwen-strength on Chinese + 60+ languages. Outperforms most 7B-13B models on multilingual tasks while needing barely more memory.
  • Day-zero tooling support. vLLM, SGLang, llama.cpp, Ollama all shipped Q3 30B-A3B compatibility within hours of release. Less tooling lag than the 235B family.
  • Strong reasoning + coding combo for the active-parameter count. Doesn't beat 32B dense models like Qwen 2.5 Coder 32B on coding-specific benchmarks, but for general daily-driver work it's competitive at lower hardware cost.

Limitations

  • Quality ceiling is below the 235B siblings. Q3 30B-A3B is "frontier-adjacent" — comfortably better than 7B-class, meaningfully behind the 235B-class. The right framing: this is the best you can run on a 24-GB card, not the best Qwen has.
  • MoE expert routing isn't perfect. Some prompts activate suboptimal expert combinations, producing outputs that feel "off" compared to dense 32B models. Less of an issue with mature vLLM routing; more visible on llama.cpp implementations.
  • 3B active parameters means it can be more brittle on edge cases vs a 32B dense model. For production pipelines that need predictable behavior, dense 32B (Qwen 2.5 32B Instruct, Qwen 2.5 Coder 32B) might be the more reliable pick.
  • Effective context is ~32K despite the spec advertising more. Quality drops past 32K in our internal testing; not a 128K-effective model.

Real-world performance on RTX 4090 (24 GB)

  • Q4_K_M (~17 GB): ~80-110 tok/s decode, TTFT ~80-150 ms on 1K prompts. The headline daily-driver workload.
  • Q5_K_M (~20 GB): ~65-90 tok/s, slightly better quality, less context room.
  • Q8_0 (~30 GB partial-offload): ~25-40 tok/s. Quality bump over Q4 is small; rarely worth the speed loss.
  • Compare with: Qwen 2.5 32B Instruct at Q4 on same hardware: ~35-50 tok/s. Q3 30B-A3B's MoE efficiency wins by 2-3× on raw speed.

Should you run this locally?

Yes, for anyone with a 24-GB single GPU who wants frontier-adjacent quality at consumer hardware tier. This is the Qwen-MoE-on-a-4090 daily driver. The right pick for general assistant work, RAG pipelines, agent loops, and most coding tasks. If you have the hardware, run this.

Yes, for Mac Studio M-series operators who want a fast Qwen variant for daily use without committing to the 235B-tier hardware footprint. Q4 fits any M-class with 32 GB+ unified memory.

No, for anyone running a sub-16-GB card. Q4 needs ~17 GB; partial-offload doesn't make sense for a model that's already MoE-efficient. Use Qwen 3 8B or Qwen 2.5 7B instead at smaller-card tiers.

Probably not, for anyone whose primary workload is coding (Qwen 2.5 Coder 32B at Q4 fits 24 GB and outperforms on coding-specific benchmarks at the cost of slower decode).

Probably not, for anyone whose primary need is multilingual (where Q3 30B-A3B beats most alternatives but the dense-32B Qwen variants beat it slightly).

How it compares

  • vs Qwen 3.5 235B-A17B (frontier) → 235B has higher quality ceiling but needs 128-GB+ hardware. 30B-A3B fits a 24-GB consumer card. Pick 30B-A3B for accessibility; pick 235B-A17B if you have the hardware AND need the extra quality. Different operator tiers.
  • vs Qwen 3 235B-A22B (prior-gen frontier) → same hardware contrast as 3.5. The 30B-A3B is the consumer-tier answer to either 235B variant.
  • vs Qwen 2.5 32B Instruct (dense, same param count, same VRAM) → 32B-Instruct is dense (32B compute per token vs 3B for the MoE). 32B-Instruct edges quality on most benchmarks; 30B-A3B is ~2-3× faster. Pick MoE for daily-driver speed; dense for production reliability.
  • vs Qwen 2.5 Coder 32B (coding specialist) → Coder 32B beats Q3 30B-A3B on coding tasks. Pick Coder 32B if you're using Aider or Continue for serious code work; pick Q3 30B-A3B for general assistant + light coding.
  • vs DeepSeek R1 Distill family (reasoning specialists) → R1 Distill 32B specializes in reasoning chain-of-thought. Q3 30B-A3B is generalist. Pick R1 Distill for math + logic puzzles; pick 30B-A3B for daily mixed-use.
  • vs Llama 4 Scout → Scout has 128k effective context vs 30B-A3B's 32k. Llama license has 700M MAU clause; Apache 2.0 wins for most teams. Pick Scout for long-context; pick 30B-A3B for license simplicity.

Run this yourself

# RTX 4090 / 3090 / 7900 XTX — single-card 24 GB
ollama pull qwen3:30b-a3b
ollama run qwen3:30b-a3b

# For better runtime control via llama.cpp:
llama-server -m qwen3-30b-a3b-Q4_K_M.gguf \
  --ctx-size 32768 -ngl 999 --temp 0.7

# For multi-user serving via vLLM (production-tier):
vllm serve Qwen/Qwen3-30B-A3B-Instruct \
  --tensor-parallel-size 1 --max-model-len 32768
Quant: Q4_K_M GGUF Context: 32768 (KV cache f16, ~2 GB additional) Backend: llama.cpp via Ollama, CUDA 12.x Hardware: RTX 4090, NVIDIA driver 555+

Overview

Mid-tier Qwen 3 MoE. 30B total / 3B active means 70B-class quality at 7B-class inference speed on a single 24GB card. The sweet spot of the Qwen 3 lineup for prosumer hardware.

Featured in this stack

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model
Qwen 3 235B-A22B235B
Frontier

Strengths

  • 3B active params = fast inference
  • Apache 2.0
  • Thinking mode

Weaknesses

  • Total weights still 18GB at Q4
  • MoE routing varies in quality

Prompting kit

From model card
source

Tested patterns for getting the most out of Qwen 3 30B-A3B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Recommended system prompt

You are Qwen, a helpful assistant created by Alibaba Cloud. Answer the user's question directly and concisely. When the task requires step-by-step analysis, work through it carefully before giving the final answer.

Quirks to know

  • Mixture-of-experts: 30B total parameters, ~3B active per forward pass. Inference cost matches a 3B dense model; quality lands closer to a 14B dense model per Qwen's release notes.
  • Same /think and /no_think mode toggles as Qwen 3 32B dense. /think for reasoning, /no_think for short Q&A.
  • Native 32K context. Extendable to 128K with YaRN scaling — set rope_scaling factor to 4.0.
  • Uses ChatML format with <|im_start|> / <|im_end|> role tokens — confirm tokenizer_config.json template in your runtime.
  • MoE inference: requires more total VRAM (load all experts) than dense at same active count. Quantization works but Q4_K_M MoE quants are noisier than Q4_K_M dense — Qwen recommends Q5_K_M or Q6_K for MoE if VRAM allows.

Chat template

ChatML (Qwen3 variant)

Same template as Qwen 3 32B — apply via runtime, not hand-rolled, because the /think toggle inserts an extra system marker.

Tool calling

✓ Supported(hermes-style)

Same Hermes-style format as Qwen 3 32B. Tools declared in system prompt, calls emitted as <tool_call>{...}</tool_call> blocks.

Sampler settings

temperature
0.7
top_p
0.8
top_k
20

Vendor defaults. For /think mode, switch to temperature 0.6, top_p 0.95 per the model card.

Browse prompting kits for every model →/prompting

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M18.0 GB22 GB
Q8_032.0 GB36 GB

Get the model

Ollama

One-line install

ollama run qwen3:30bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/Qwen/Qwen3-30B-A3B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Qwen 3 30B-A3B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Qwen 3 30B-A3B?

22GB of VRAM is enough to run Qwen 3 30B-A3B at the Q4_K_M quantization (file size 18.0 GB). Higher-quality quantizations need more.

Can I use Qwen 3 30B-A3B commercially?

Yes — Qwen 3 30B-A3B ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Qwen 3 30B-A3B?

Qwen 3 30B-A3B supports a context window of 131,072 tokens (about 131K).

How do I install Qwen 3 30B-A3B with Ollama?

Run `ollama pull qwen3:30b` to download, then `ollama run qwen3:30b` to start a chat session. The default quantization is Q4_K_M.

Compare against other models

Curated head-to-head decisions where Qwen 3 30B-A3B is one of the contenders. For arbitrary pairings use /model-battle.

Source: huggingface.co/Qwen/Qwen3-30B-A3B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Qwen 3 30B-A3B runs on your specific hardware before committing money.