Phi-4 14B
Microsoft's Phi-4 14B trained on synthetic textbook-quality data. Punches above weight on reasoning and math; MIT licensed.
Positioning
Phi-4 14B is the strongest entry in the Phi line and a legitimate alternative to Qwen 2.5 14B / Qwen 3 14B in the 12–16 GB VRAM bracket. It earns the score by being unusually strong on math and reasoning relative to its parameter count — the Phi philosophy paying off.
Strengths
- Math + structured reasoning lead the size class — beats Qwen 2.5 14B on GSM8K and MATH.
- MIT license — cleanest license in the 14B tier.
- Knowledge curation shows — fewer hallucinations on technical content.
Limitations
- Open-domain knowledge is shallower than Qwen / Llama at similar size — synthetic textbook training has tradeoffs.
- Refusal behavior is conservative — over-cautious on dual-use technical questions.
- Multilingual is weak — English-first training shows.
Real-world performance on RTX 4090
- Q4_K_M (8.4 GB): 70–85 tok/s decode, TTFT ~100 ms
- Q5_K_M (9.9 GB): 60–75 tok/s
- Q8_0 (14.7 GB): 42–52 tok/s
Should you run this locally?
Yes, for math and reasoning workloads, technical writing, code review tasks. Strongest 14B for those jobs. No, for general open-domain chat, multilingual workloads, or anything requiring broad pop-culture / current-events knowledge.
How it compares
- vs Phi-3.5 Mini (3.8B) → Phi-4 is materially more capable across the board; different VRAM tier.
- vs Phi-4 Reasoning 14B → Reasoning variant pushes hard problems further with chain-of-thought; base Phi-4 is faster on simple prompts.
- vs Qwen 2.5 14B → Phi-4 wins on math/reasoning; Qwen wins on knowledge breadth and multilingual.
- vs Qwen 3 14B → coin flip on hard tasks. Qwen 3 has hybrid mode flexibility; Phi-4 has cleaner license.
Run this yourself
ollama pull phi4:14b-q4_K_M
ollama run phi4:14b-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4060 Ti 16 GB / 4090
›Why this rating
8.6/10 — Microsoft's curated-data approach scaled to 14B. Reasoning quality is genuinely impressive — competitive with much larger models — and the synthetic-textbook training shows on math and structured tasks. Loses points only because Qwen 3 14B's hybrid mode offers more flexibility.
Overview
Microsoft's Phi-4 14B trained on synthetic textbook-quality data. Punches above weight on reasoning and math; MIT licensed.
Featured in this stack
The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Homelab tier·Role: Primary chat / lightweight coding modelBuild a 16GB VRAM local AI stack (May 2026)
Phi-4 14B over Qwen 2.5 14B for the 16GB tier: Phi-4 has stronger reasoning per parameter and fits Q4_K_M comfortably (~9.5GB) with KV-cache headroom for 8K context. Qwen 2.5 14B is the alternative when reasoning matters less than coding-specific quality.
Execution notes
Operator notes
Phi-4 14B is Microsoft's reasoning-per-parameter champion in the 14B class. The Phi family's traditional advantage — strong reasoning quality at small parameter counts via curated training data — carries into Phi-4. MIT-licensed; no commercial-use friction.
The right pick for the 16 GB VRAM tier when reasoning matters more than coding-specific quality. For coding, Qwen 2.5 Coder 14B wins; for general reasoning + chat, Phi-4 14B is the operator default.
Deployment notes
The /stacks/16gb-vram-local-ai canonical recipe pairs Phi-4 14B with Ollama on RTX 4060 Ti 16 GB. Throughput is 25-35 tok/s; power draw ~135 W under load (half a 4090). The configuration runs comfortably within the budget tier without requiring upscale to RTX 4090.
For multimodal workflows, Phi-4 Multimodal is the same family at 14B with vision support — different /stacks/local-vision-model fit.
For edge / phone tier, Phi-4 Mini 4B is the same family compressed to ~3.8B params.
Runtime compatibility
- Ollama ✓ excellent. Q4_K_M GGUF pulls in one command; canonical first-pull experience.
- vLLM ✓ good. AWQ available; less common than the GGUF path at this size class.
- MLX-LM ✓ good. Apple Silicon path; the 14B size sits comfortably in 24 GB unified memory.
- llama.cpp ✓ excellent. Native GGUF support; the engine under Ollama / LM Studio.
Quantization suitability
Q4_K_M is the production-recommended quant. Phi-4's training discipline shows up in quant survival — it loses less quality at lower quants than typical models. Q5_K_M provides marginal benefit at ~30% more memory; usually not worth it. Avoid Q3-class — the reasoning depth that's Phi's edge degrades meaningfully.
When to use a different model
- Coding-first workloads: Qwen 2.5 Coder 14B — same VRAM envelope; better on coding benchmarks.
- 24 GB VRAM tier: graduate to Qwen 2.5 Coder 32B or DeepSeek R1 Distill Qwen 32B.
- 8 GB VRAM tier: drop to Phi-4 Mini 4B or Qwen 2.5 7B Instruct.
- Multimodal: Phi-4 Multimodal — same family, vision support.
Best use cases
- Consumer-tier reasoning + chat — 16 GB VRAM workstation deployments where coding isn't the primary workload.
- Single-user agent workflows — paired with Ollama on RTX 4060 Ti / 4070 Super; covers most non-autonomous coding-agent scenarios.
- Document summarization + analysis at the 16 GB tier.
- Educational deployment — MIT license + strong reasoning makes this the right pick for academic courseware.
Failure modes
- Tool-call format quirks. Phi-4 occasionally emits tool calls with slightly non-standard JSON; OpenHands / OpenClaw parsers handle most cases but verify with your specific harness.
- Long-context KV pressure on 16 GB cards. Default Ollama 8K context is the right ceiling; pushing to 32K eats the headroom for KV cache.
- Reasoning depth ceiling. 14B is the parameter ceiling for reasoning; complex multi-step problems benefit from 32B-class even at the same architecture.
Going deeper
- /stacks/16gb-vram-local-ai — canonical 16 GB VRAM deployment recipe
- Ollama operational notes — runtime-specific detail
- /maps/inference-runtimes-2026 — runtime landscape
- Phi-4 Mini 4B, Phi-4 Multimodal — Phi-4 family siblings
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- MIT license
- Strong math and reasoning per param
- 16K context
Weaknesses
- Smaller context than Qwen/Llama
- Synthetic-data training shows in creative tasks
Prompting kit
Tested patterns for getting the most out of Phi-4 14B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.
Recommended system prompt
You are a careful, accurate assistant. Think step by step before answering. If a problem is mathematical or logical, work through it carefully and show your reasoning.
Quirks to know
- •Phi-4 is 14B parameters but Microsoft's benchmarks show it matching Llama 3.3 70B on math and reasoning. The trade-off: world-knowledge breadth is narrower than its size would suggest — don't lean on it for trivia, lean on it for structured reasoning.
- •16K context window per the model card. Shorter than current peers; if your task needs longer context, use Phi-4-mini or step up to a different family.
- •Strict format adherence — Phi-4 tends to follow output format instructions more tightly than other models its size. Useful for JSON / structured output; sometimes too literal for casual chat.
- •Heavy refusals on coding security topics (CVE details, exploit chains). Per the Phi-4 responsible-AI documentation, this is intentional. Phi-4 is not the right model for offensive-security work.
- •Uses ChatML chat template — most runtimes handle this automatically, but if you're hand-rolling, the system/user/assistant tokens are the standard ChatML form.
Chat template
Standard ChatML with <|im_start|>{role}\n{content}<|im_end|>. The tokenizer_config.json ships the canonical template.
Tool calling
Base Phi-4 doesn't ship with native tool-calling tuning. Per the model card, function calling can be achieved through prompt convention but format reliability degrades vs models like Llama 3.3 or Mistral Small that were trained for it.
Sampler settings
- temperature
- 0.7
- top_p
- 0.95
Microsoft doesn't publish strict sampler defaults for Phi-4. These are the values used in the model's own technical-report evaluation runs.
Reviewed quality benchmarks
First-party rows were run by RunLocalAI; reviewed community rows are labeled in the data. Every row links to the raw test-run log.
| Benchmark | Quant | Runtime / Hardware | Score | Raw log |
|---|---|---|---|---|
HumanEval+ tested 2026-05-28 | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 78.7/100 | Gist → |
MBPP+ tested 2026-05-29 | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 60.3/100 | Gist → |
Q4_K_M note:First-party HumanEval+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py.
Q4_K_M note:First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py.
Want to verify? Every row links to its Gist with full stdout and stderr of the run. The runner script is in the public repo (scripts/run-humaneval-plus.ts) — reproducible end-to-end. Browse all coding scores at /benchmarks/coding.
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 8.4 GB | 11 GB |
| Q8_0 | 15.0 GB | 18 GB |
Get the model
Ollama
One-line install
ollama run phi4:14bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Phi-4 14B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Phi-4 14B?
Can I use Phi-4 14B commercially?
What's the context length of Phi-4 14B?
How do I install Phi-4 14B with Ollama?
Source: huggingface.co/microsoft/phi-4
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Phi-4 14B runs on your specific hardware before committing money.