phi
14B parameters
Commercial OK
Reviewed June 2026

Phi-4 14B

Microsoft's Phi-4 14B trained on synthetic textbook-quality data. Punches above weight on reasoning and math; MIT licensed.

License: MIT·Released Dec 12, 2024·Context: 16,384 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
8.6/10

Positioning

Phi-4 14B is the strongest entry in the Phi line and a legitimate alternative to Qwen 2.5 14B / Qwen 3 14B in the 12–16 GB VRAM bracket. It earns the score by being unusually strong on math and reasoning relative to its parameter count — the Phi philosophy paying off.

Strengths

  • Math + structured reasoning lead the size class — beats Qwen 2.5 14B on GSM8K and MATH.
  • MIT license — cleanest license in the 14B tier.
  • Knowledge curation shows — fewer hallucinations on technical content.

Limitations

  • Open-domain knowledge is shallower than Qwen / Llama at similar size — synthetic textbook training has tradeoffs.
  • Refusal behavior is conservative — over-cautious on dual-use technical questions.
  • Multilingual is weak — English-first training shows.

Real-world performance on RTX 4090

  • Q4_K_M (8.4 GB): 70–85 tok/s decode, TTFT ~100 ms
  • Q5_K_M (9.9 GB): 60–75 tok/s
  • Q8_0 (14.7 GB): 42–52 tok/s

Should you run this locally?

Yes, for math and reasoning workloads, technical writing, code review tasks. Strongest 14B for those jobs. No, for general open-domain chat, multilingual workloads, or anything requiring broad pop-culture / current-events knowledge.

How it compares

  • vs Phi-3.5 Mini (3.8B) → Phi-4 is materially more capable across the board; different VRAM tier.
  • vs Phi-4 Reasoning 14B → Reasoning variant pushes hard problems further with chain-of-thought; base Phi-4 is faster on simple prompts.
  • vs Qwen 2.5 14B → Phi-4 wins on math/reasoning; Qwen wins on knowledge breadth and multilingual.
  • vs Qwen 3 14B → coin flip on hard tasks. Qwen 3 has hybrid mode flexibility; Phi-4 has cleaner license.

Run this yourself

ollama pull phi4:14b-q4_K_M
ollama run phi4:14b-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4060 Ti 16 GB / 4090
Why this rating

8.6/10 — Microsoft's curated-data approach scaled to 14B. Reasoning quality is genuinely impressive — competitive with much larger models — and the synthetic-textbook training shows on math and structured tasks. Loses points only because Qwen 3 14B's hybrid mode offers more flexibility.

Overview

Microsoft's Phi-4 14B trained on synthetic textbook-quality data. Punches above weight on reasoning and math; MIT licensed.

Featured in this stack

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Homelab tier·Role: Primary chat / lightweight coding model
    Build a 16GB VRAM local AI stack (May 2026)

    Phi-4 14B over Qwen 2.5 14B for the 16GB tier: Phi-4 has stronger reasoning per parameter and fits Q4_K_M comfortably (~9.5GB) with KV-cache headroom for 8K context. Qwen 2.5 14B is the alternative when reasoning matters less than coding-specific quality.

Execution notes

L1.25 enriched

Operator notes

Phi-4 14B is Microsoft's reasoning-per-parameter champion in the 14B class. The Phi family's traditional advantage — strong reasoning quality at small parameter counts via curated training data — carries into Phi-4. MIT-licensed; no commercial-use friction.

The right pick for the 16 GB VRAM tier when reasoning matters more than coding-specific quality. For coding, Qwen 2.5 Coder 14B wins; for general reasoning + chat, Phi-4 14B is the operator default.

Deployment notes

The /stacks/16gb-vram-local-ai canonical recipe pairs Phi-4 14B with Ollama on RTX 4060 Ti 16 GB. Throughput is 25-35 tok/s; power draw ~135 W under load (half a 4090). The configuration runs comfortably within the budget tier without requiring upscale to RTX 4090.

For multimodal workflows, Phi-4 Multimodal is the same family at 14B with vision support — different /stacks/local-vision-model fit.

For edge / phone tier, Phi-4 Mini 4B is the same family compressed to ~3.8B params.

Runtime compatibility

  • Ollama ✓ excellent. Q4_K_M GGUF pulls in one command; canonical first-pull experience.
  • vLLM ✓ good. AWQ available; less common than the GGUF path at this size class.
  • MLX-LM ✓ good. Apple Silicon path; the 14B size sits comfortably in 24 GB unified memory.
  • llama.cpp ✓ excellent. Native GGUF support; the engine under Ollama / LM Studio.

Quantization suitability

Q4_K_M is the production-recommended quant. Phi-4's training discipline shows up in quant survival — it loses less quality at lower quants than typical models. Q5_K_M provides marginal benefit at ~30% more memory; usually not worth it. Avoid Q3-class — the reasoning depth that's Phi's edge degrades meaningfully.

When to use a different model

Best use cases

  • Consumer-tier reasoning + chat — 16 GB VRAM workstation deployments where coding isn't the primary workload.
  • Single-user agent workflows — paired with Ollama on RTX 4060 Ti / 4070 Super; covers most non-autonomous coding-agent scenarios.
  • Document summarization + analysis at the 16 GB tier.
  • Educational deployment — MIT license + strong reasoning makes this the right pick for academic courseware.

Failure modes

  1. Tool-call format quirks. Phi-4 occasionally emits tool calls with slightly non-standard JSON; OpenHands / OpenClaw parsers handle most cases but verify with your specific harness.
  2. Long-context KV pressure on 16 GB cards. Default Ollama 8K context is the right ceiling; pushing to 32K eats the headroom for KV cache.
  3. Reasoning depth ceiling. 14B is the parameter ceiling for reasoning; complex multi-step problems benefit from 32B-class even at the same architecture.

Going deeper

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Distilled / fine-tuned from this

Strengths

  • MIT license
  • Strong math and reasoning per param
  • 16K context

Weaknesses

  • Smaller context than Qwen/Llama
  • Synthetic-data training shows in creative tasks

Prompting kit

From model card
source

Tested patterns for getting the most out of Phi-4 14B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Recommended system prompt

You are a careful, accurate assistant. Think step by step before answering. If a problem is mathematical or logical, work through it carefully and show your reasoning.

Quirks to know

  • Phi-4 is 14B parameters but Microsoft's benchmarks show it matching Llama 3.3 70B on math and reasoning. The trade-off: world-knowledge breadth is narrower than its size would suggest — don't lean on it for trivia, lean on it for structured reasoning.
  • 16K context window per the model card. Shorter than current peers; if your task needs longer context, use Phi-4-mini or step up to a different family.
  • Strict format adherence — Phi-4 tends to follow output format instructions more tightly than other models its size. Useful for JSON / structured output; sometimes too literal for casual chat.
  • Heavy refusals on coding security topics (CVE details, exploit chains). Per the Phi-4 responsible-AI documentation, this is intentional. Phi-4 is not the right model for offensive-security work.
  • Uses ChatML chat template — most runtimes handle this automatically, but if you're hand-rolling, the system/user/assistant tokens are the standard ChatML form.

Chat template

ChatML

Standard ChatML with <|im_start|>{role}\n{content}<|im_end|>. The tokenizer_config.json ships the canonical template.

Tool calling

✗ Not supported

Base Phi-4 doesn't ship with native tool-calling tuning. Per the model card, function calling can be achieved through prompt convention but format reliability degrades vs models like Llama 3.3 or Mistral Small that were trained for it.

Sampler settings

temperature
0.7
top_p
0.95

Microsoft doesn't publish strict sampler defaults for Phi-4. These are the values used in the model's own technical-report evaluation runs.

Browse prompting kits for every model →/prompting
BLK · QUALITY BENCHMARKreviewed · raw logs

Reviewed quality benchmarks

First-party rows were run by RunLocalAI; reviewed community rows are labeled in the data. Every row links to the raw test-run log.

BenchmarkQuantRuntime / HardwareScoreRaw log
HumanEval+
tested 2026-05-28
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
78.7/100
Gist →
MBPP+
tested 2026-05-29
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
60.3/100
Gist →

Q4_K_M note:First-party HumanEval+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py.

Q4_K_M note:First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py.

Want to verify? Every row links to its Gist with full stdout and stderr of the run. The runner script is in the public repo (scripts/run-humaneval-plus.ts) — reproducible end-to-end. Browse all coding scores at /benchmarks/coding.

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M8.4 GB11 GB
Q8_015.0 GB18 GB

Get the model

Ollama

One-line install

ollama run phi4:14bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/microsoft/phi-4

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Phi-4 14B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Phi-4 14B?

11GB of VRAM is enough to run Phi-4 14B at the Q4_K_M quantization (file size 8.4 GB). Higher-quality quantizations need more.

Can I use Phi-4 14B commercially?

Yes — Phi-4 14B ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of Phi-4 14B?

Phi-4 14B supports a context window of 16,384 tokens (about 16K).

How do I install Phi-4 14B with Ollama?

Run `ollama pull phi4:14b` to download, then `ollama run phi4:14b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/microsoft/phi-4

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Phi-4 14B runs on your specific hardware before committing money.