qwen
14B parameters
Commercial OK
Reviewed June 2026

Qwen 3 14B

14B Qwen 3. Fits on 12GB cards at Q4. Strong default for users with a single mid-range GPU.

License: Apache 2.0·Released Apr 29, 2025·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
8.8/10

Positioning

The most capability per VRAM available. Qwen 3 14B at Q4_K_M fits in ~9 GB, leaving room for full 32K context on a 16 GB card, and in thinking mode it punches well above its parameter weight on math and code.

Strengths

  • 9 GB at Q4_K_M — leaves headroom on RTX 3060 12 GB and RTX 4060 Ti 16 GB.
  • Hybrid reasoning lifts hard-task scores by 10–15 points GSM8K-equivalent vs non-thinking.
  • Long context recall holds up out to 32K in practice — better than Qwen 2.5 14B.

Limitations

  • Thinking-mode latency is real — budget 2–3× tokens for hard prompts.
  • Tool use still rougher than Llama — function-call loops occasionally derail.
  • License caps unchanged from Qwen 2.5.

Real-world performance on RTX 4090

  • Q4_K_M (9.1 GB): 60–75 tok/s decode (non-thinking); same speed thinking but 2–3× output
  • Q5_K_M (10.5 GB): 50–62 tok/s
  • Q8_0 (15.8 GB): 36–46 tok/s

Should you run this locally?

Yes, for RTX 3060 12 GB / 4060 Ti 16 GB / 4070 / 5070 owners who want the best capability for their hardware tier. New default for 12–16 GB cards. No, for users on 24 GB cards — jump to Qwen 3 32B or QwQ 32B; 14B is the wrong tier for that VRAM.

How it compares

  • vs Qwen 2.5 14B → Qwen 3 14B with thinking mode is materially better; non-thinking is roughly even. Pick Qwen 3 going forward.
  • vs Phi-4 14B → close call. Phi-4 has more polished reasoning; Qwen 3 14B has hybrid mode. Pick Phi-4 for steady reasoning, Qwen 3 14B for flexibility.
  • vs Mistral Small 3 24B → Mistral Small is bigger, slightly stronger absolute capability; Qwen 3 14B is much more memory-efficient.
  • vs Qwen 3 8B → 14B is meaningfully smarter; pick 14B if VRAM allows.

Run this yourself

ollama pull qwen3:14b
ollama run qwen3:14b
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090 / 4060 Ti 16 GB
Why this rating

8.8/10 — the new 14B-class king. Qwen 3 14B in thinking mode hits performance bands previously reserved for 30B-class models, while staying inside 12 GB VRAM at Q4. The model 16 GB GPU owners should default to.

Overview

14B Qwen 3. Fits on 12GB cards at Q4. Strong default for users with a single mid-range GPU.

Featured in this stack

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Workstation tier·Role: Chat model (low-latency general-purpose)
    Build an RTX 4090 AI workstation stack (May 2026)

    Qwen 3 14B at FP16 fits with massive headroom; serves chat, summaries, and tool-call workloads at 60+ tok/s with single-digit-ms TTFT on warm prefix. The right default when you don't need coding-class reasoning.

Execution notes

L1.25 enriched

Operator notes

Qwen 3 14B is the consumer-tier reasoning model with thinking-mode toggle — the model that lets a single 16 GB GPU run extended chain-of-thought reasoning without dropping to 8B-tier quality.

What makes it the consumer-tier reasoning default:

  • Hybrid thinking-mode toggleenable_thinking=true for hard problems, false for fast chat. The toggle is per-call, not per-model.
  • Q4_K_M fits an 11 GB VRAM budget with 32K context — runs comfortably on RTX 4070 Ti / 4080 / 5070 Ti.
  • Apache 2.0 license — no commercial friction.
  • 128K context — long enough for most agent harness scenarios without context-window chunking.

Deployment notes

The /stacks/local-coding-agent canonical model is Qwen 2.5 Coder 32B, but if your reasoning workload exceeds what the Coder model handles (math, multi-step proof, ambiguous-spec design), Qwen 3 14B with thinking-mode is the consumer-tier upgrade.

Pair with:

  • vLLM + AWQ-INT4 for production serving. The prefix-cache wins at agent-loop time are real even at consumer hardware.
  • Ollama + Q4_K_M for solo-developer setups. Simpler ergonomics, 80% of the throughput.
  • Continue or Aider for surgical-edit workflows.

For sub-16GB cards drop to Qwen 3 8B — same family, same thinking-mode toggle, fits 8 GB VRAM at Q4_K_M.

For workstation-tier upgrade go to Qwen 3 32B — the L1.25-enriched dense flagship in the family.

Runtime compatibility

  • vLLM ✓ excellent. AWQ-INT4 supported; the production-default path.
  • SGLang ✓ excellent. Particularly good when thinking-mode-on workloads have high prefix overlap.
  • Ollama ✓ excellent. Q4_K_M GGUF is the solo-developer pick.
  • llama.cpp ✓ excellent. Native GGUF support; Apple Silicon path via MLX-LM is also strong.
  • MLX-LM ✓ excellent. M-series unified memory holds 32K context comfortably at MLX-4bit.
  • TensorRT-LLM ✓ supported but rarely justified at this scale.

Quantization suitability

Q4_K_M GGUF is the consumer-tier sweet spot. AWQ-INT4 if you're on vLLM. Q5_K_M is feasible on 16 GB if you trim context. Q8 is overkill at 14B.

Avoid Q3-class quants for thinking-mode workloads — the chain-of-thought tokens are tightly tuned and the quality drop is disproportionate vs prior Qwen generations.

Best use cases

  • Consumer-tier reasoning agents — thinking-mode toggle gives per-call control over chain-of-thought depth.
  • Math problem solving at 16 GB-VRAM tier.
  • Multi-step planning for agent harnesses where the Coder-tier reasoning isn't deep enough.
  • General chat with reasoning fallbackenable_thinking=false for fast chat, true when the question is hard.

When to use a different model

  • Code-first workloads: Qwen 2.5 Coder 14B — coding-specialized at the same VRAM tier.
  • Workstation-tier reasoning: Qwen 3 32B — the L1.25-enriched flagship, sharper reasoning.
  • Reasoning without thinking-mode toggle: DeepSeek R1 Distill Qwen 14B — always-reasoning behavior; pick by paradigm preference.
  • 8 GB-VRAM tier: Qwen 3 8B.

Failure modes specific to this model

  1. Thinking-mode token bloat. enable_thinking=true produces 5-10× more tokens. Your concurrency model needs to budget for this — a single thinking-mode query can consume the throughput of 5-10 fast-chat queries.
  2. Tool-call format inconsistency under thinking-mode. The chain-of-thought sometimes leaks into the tool-call JSON when temperature is high. Keep temperature ≤0.4 for tool-using agents.
  3. Reasoning-tag stripping. Some agent harnesses don't strip <think> tags before display; the user sees raw chain-of-thought. Strip server-side or configure your harness.

Going deeper

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model
Qwen 3 32B32B
Workstation
Distilled / fine-tuned from this

Strengths

  • Fits on RTX 3060/4060 Ti
  • Apache 2.0

Weaknesses

  • Some Chinese tokenizer quirks

Prompting kit

From model card
source

Tested patterns for getting the most out of Qwen 3 14B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Recommended system prompt

You are Qwen, a helpful assistant created by Alibaba Cloud. Answer directly and concisely. For multi-step problems, work through your reasoning before giving the final answer.

Quirks to know

  • Mid-tier Qwen 3 sibling. Fits comfortably in 16GB VRAM at Q4_K_M, in 24GB at Q6_K.
  • Same /think + /no_think toggle as the rest of the Qwen 3 family.
  • Native 32K context, extendable to 128K with YaRN.
  • ChatML chat template, same as the rest of the Qwen 3 family.
  • Per the model card, tool-call reliability sits between Qwen 3 8B and Qwen 3 32B — usable for production but expect occasional schema drift on long tool chains.

Chat template

ChatML (Qwen3 variant)

Same template as Qwen 3 32B and Qwen 3 8B.

Tool calling

✓ Supported(hermes-style)

Hermes-style. Same convention as the rest of the Qwen 3 family.

Sampler settings

temperature
0.7
top_p
0.8
top_k
20

Vendor defaults. /think mode: 0.6 / 0.95 per the model card.

Browse prompting kits for every model →/prompting

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M8.4 GB11 GB
Q8_015.0 GB18 GB

Get the model

Ollama

One-line install

ollama run qwen3:14bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/Qwen/Qwen3-14B

Source repository — direct quantization required.

Benchmarks

Real measurements on real hardware. Numbers ship with the runner version, quant, and date.

1 run on record
HardwareProvenanceQuantCtxTokens / secTTFTDate
NVIDIA GeForce RTX 3080 16GB (Mobile)
EditorialM
Q4_K_M4K
38.3tok/s
310 msJun 2, 26

What to do next

Got this model running on real hardware? Share what you measured — the form arrives with the model pre-selected.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Qwen 3 14B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Qwen 3 14B?

11GB of VRAM is enough to run Qwen 3 14B at the Q4_K_M quantization (file size 8.4 GB). Higher-quality quantizations need more.

Can I use Qwen 3 14B commercially?

Yes — Qwen 3 14B ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Qwen 3 14B?

Qwen 3 14B supports a context window of 131,072 tokens (about 131K).

How do I install Qwen 3 14B with Ollama?

Run `ollama pull qwen3:14b` to download, then `ollama run qwen3:14b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/Qwen/Qwen3-14B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Qwen 3 14B runs on your specific hardware before committing money.