Qwen 3 14B
14B Qwen 3. Fits on 12GB cards at Q4. Strong default for users with a single mid-range GPU.
Positioning
The most capability per VRAM available. Qwen 3 14B at Q4_K_M fits in ~9 GB, leaving room for full 32K context on a 16 GB card, and in thinking mode it punches well above its parameter weight on math and code.
Strengths
- 9 GB at Q4_K_M — leaves headroom on RTX 3060 12 GB and RTX 4060 Ti 16 GB.
- Hybrid reasoning lifts hard-task scores by 10–15 points GSM8K-equivalent vs non-thinking.
- Long context recall holds up out to 32K in practice — better than Qwen 2.5 14B.
Limitations
- Thinking-mode latency is real — budget 2–3× tokens for hard prompts.
- Tool use still rougher than Llama — function-call loops occasionally derail.
- License caps unchanged from Qwen 2.5.
Real-world performance on RTX 4090
- Q4_K_M (9.1 GB): 60–75 tok/s decode (non-thinking); same speed thinking but 2–3× output
- Q5_K_M (10.5 GB): 50–62 tok/s
- Q8_0 (15.8 GB): 36–46 tok/s
Should you run this locally?
Yes, for RTX 3060 12 GB / 4060 Ti 16 GB / 4070 / 5070 owners who want the best capability for their hardware tier. New default for 12–16 GB cards. No, for users on 24 GB cards — jump to Qwen 3 32B or QwQ 32B; 14B is the wrong tier for that VRAM.
How it compares
- vs Qwen 2.5 14B → Qwen 3 14B with thinking mode is materially better; non-thinking is roughly even. Pick Qwen 3 going forward.
- vs Phi-4 14B → close call. Phi-4 has more polished reasoning; Qwen 3 14B has hybrid mode. Pick Phi-4 for steady reasoning, Qwen 3 14B for flexibility.
- vs Mistral Small 3 24B → Mistral Small is bigger, slightly stronger absolute capability; Qwen 3 14B is much more memory-efficient.
- vs Qwen 3 8B → 14B is meaningfully smarter; pick 14B if VRAM allows.
Run this yourself
ollama pull qwen3:14b
ollama run qwen3:14b
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090 / 4060 Ti 16 GB
›Why this rating
8.8/10 — the new 14B-class king. Qwen 3 14B in thinking mode hits performance bands previously reserved for 30B-class models, while staying inside 12 GB VRAM at Q4. The model 16 GB GPU owners should default to.
Overview
14B Qwen 3. Fits on 12GB cards at Q4. Strong default for users with a single mid-range GPU.
Featured in this stack
The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Workstation tier·Role: Chat model (low-latency general-purpose)Build an RTX 4090 AI workstation stack (May 2026)
Qwen 3 14B at FP16 fits with massive headroom; serves chat, summaries, and tool-call workloads at 60+ tok/s with single-digit-ms TTFT on warm prefix. The right default when you don't need coding-class reasoning.
Execution notes
Operator notes
Qwen 3 14B is the consumer-tier reasoning model with thinking-mode toggle — the model that lets a single 16 GB GPU run extended chain-of-thought reasoning without dropping to 8B-tier quality.
What makes it the consumer-tier reasoning default:
- Hybrid thinking-mode toggle —
enable_thinking=truefor hard problems,falsefor fast chat. The toggle is per-call, not per-model. - Q4_K_M fits an 11 GB VRAM budget with 32K context — runs comfortably on RTX 4070 Ti / 4080 / 5070 Ti.
- Apache 2.0 license — no commercial friction.
- 128K context — long enough for most agent harness scenarios without context-window chunking.
Deployment notes
The /stacks/local-coding-agent canonical model is Qwen 2.5 Coder 32B, but if your reasoning workload exceeds what the Coder model handles (math, multi-step proof, ambiguous-spec design), Qwen 3 14B with thinking-mode is the consumer-tier upgrade.
Pair with:
- vLLM + AWQ-INT4 for production serving. The prefix-cache wins at agent-loop time are real even at consumer hardware.
- Ollama + Q4_K_M for solo-developer setups. Simpler ergonomics, 80% of the throughput.
- Continue or Aider for surgical-edit workflows.
For sub-16GB cards drop to Qwen 3 8B — same family, same thinking-mode toggle, fits 8 GB VRAM at Q4_K_M.
For workstation-tier upgrade go to Qwen 3 32B — the L1.25-enriched dense flagship in the family.
Runtime compatibility
- vLLM ✓ excellent. AWQ-INT4 supported; the production-default path.
- SGLang ✓ excellent. Particularly good when thinking-mode-on workloads have high prefix overlap.
- Ollama ✓ excellent. Q4_K_M GGUF is the solo-developer pick.
- llama.cpp ✓ excellent. Native GGUF support; Apple Silicon path via MLX-LM is also strong.
- MLX-LM ✓ excellent. M-series unified memory holds 32K context comfortably at MLX-4bit.
- TensorRT-LLM ✓ supported but rarely justified at this scale.
Quantization suitability
Q4_K_M GGUF is the consumer-tier sweet spot. AWQ-INT4 if you're on vLLM. Q5_K_M is feasible on 16 GB if you trim context. Q8 is overkill at 14B.
Avoid Q3-class quants for thinking-mode workloads — the chain-of-thought tokens are tightly tuned and the quality drop is disproportionate vs prior Qwen generations.
Best use cases
- Consumer-tier reasoning agents — thinking-mode toggle gives per-call control over chain-of-thought depth.
- Math problem solving at 16 GB-VRAM tier.
- Multi-step planning for agent harnesses where the Coder-tier reasoning isn't deep enough.
- General chat with reasoning fallback —
enable_thinking=falsefor fast chat,truewhen the question is hard.
When to use a different model
- Code-first workloads: Qwen 2.5 Coder 14B — coding-specialized at the same VRAM tier.
- Workstation-tier reasoning: Qwen 3 32B — the L1.25-enriched flagship, sharper reasoning.
- Reasoning without thinking-mode toggle: DeepSeek R1 Distill Qwen 14B — always-reasoning behavior; pick by paradigm preference.
- 8 GB-VRAM tier: Qwen 3 8B.
Failure modes specific to this model
- Thinking-mode token bloat.
enable_thinking=trueproduces 5-10× more tokens. Your concurrency model needs to budget for this — a single thinking-mode query can consume the throughput of 5-10 fast-chat queries. - Tool-call format inconsistency under thinking-mode. The chain-of-thought sometimes leaks into the tool-call JSON when temperature is high. Keep temperature ≤0.4 for tool-using agents.
- Reasoning-tag stripping. Some agent harnesses don't strip
<think>tags before display; the user sees raw chain-of-thought. Strip server-side or configure your harness.
Going deeper
- /stacks/local-coding-agent — agent-loop deployment recipe
- Qwen 3 32B — workstation-tier sibling (L1.25-enriched)
- vLLM operational review — production-recommended runtime
- Ollama operational review — solo-developer alternative
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Fits on RTX 3060/4060 Ti
- Apache 2.0
Weaknesses
- Some Chinese tokenizer quirks
Prompting kit
Tested patterns for getting the most out of Qwen 3 14B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.
Recommended system prompt
You are Qwen, a helpful assistant created by Alibaba Cloud. Answer directly and concisely. For multi-step problems, work through your reasoning before giving the final answer.
Quirks to know
- •Mid-tier Qwen 3 sibling. Fits comfortably in 16GB VRAM at Q4_K_M, in 24GB at Q6_K.
- •Same /think + /no_think toggle as the rest of the Qwen 3 family.
- •Native 32K context, extendable to 128K with YaRN.
- •ChatML chat template, same as the rest of the Qwen 3 family.
- •Per the model card, tool-call reliability sits between Qwen 3 8B and Qwen 3 32B — usable for production but expect occasional schema drift on long tool chains.
Chat template
Same template as Qwen 3 32B and Qwen 3 8B.
Tool calling
Hermes-style. Same convention as the rest of the Qwen 3 family.
Sampler settings
- temperature
- 0.7
- top_p
- 0.8
- top_k
- 20
Vendor defaults. /think mode: 0.6 / 0.95 per the model card.
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 8.4 GB | 11 GB |
| Q8_0 | 15.0 GB | 18 GB |
Get the model
Ollama
One-line install
ollama run qwen3:14bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Benchmarks
Real measurements on real hardware. Numbers ship with the runner version, quant, and date.
| Hardware | Provenance | Quant | Ctx | Tokens / sec | TTFT | Date |
|---|---|---|---|---|---|---|
| NVIDIA GeForce RTX 3080 16GB (Mobile) | EditorialM | Q4_K_M | 4K | 38.3tok/s | 310 ms | Jun 2, 26 |
What to do next
Got this model running on real hardware? Share what you measured — the form arrives with the model pre-selected.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Qwen 3 14B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Qwen 3 14B?
Can I use Qwen 3 14B commercially?
What's the context length of Qwen 3 14B?
How do I install Qwen 3 14B with Ollama?
Source: huggingface.co/Qwen/Qwen3-14B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Qwen 3 14B runs on your specific hardware before committing money.