Phi-3.5 Mini Instruct

Positioning

The right pick when VRAM is the gating constraint — sub-6 GB cards, integrated GPUs, edge devices, or as a fast secondary model for routing/classification in agent loops. Microsoft's curation against synthetic textbooks shows: it's startlingly capable for 3.8B parameters.

Strengths

2.3 GB at Q4_K_M — runs on essentially anything that exists, including 4 GB GPUs with comfortable context.
Structured output and math are genuinely good for the size class — better than Llama 3.2 3B on GSM8K and JSON-mode tasks.
MIT license: cleanest license in the curated-data model space.

Limitations

Open-domain knowledge is shallow — the textbook-only training shows on pop culture, recent events, and obscure technical lore.
Refusal behavior is aggressive — defaults to over-cautious answers on anything dual-use.
Long-context recall is weak despite the 128K spec — past ~16K, quality degrades sharply.

Real-world performance on RTX 4090

Q4_K_M (2.3 GB): 130–155 tok/s decode, TTFT under 50 ms
Q5_K_M (2.8 GB): 120–140 tok/s
Q8_0 (4.1 GB): 100–120 tok/s — surprisingly worth it; Q8 quality bump is larger than usual

Should you run this locally?

Yes, for edge deployment, fast routing/classification in agent stacks, math-heavy structured tasks, or any rig with under 6 GB VRAM. No, for open-ended chat, creative writing, or current-events tasks.

How it compares

vs Llama 3.2 3B → Phi wins on math + structured output; Llama wins on conversational naturalness and knowledge breadth. Pick Phi for tooling, Llama for chat.
vs Llama 3.1 8B → Llama 3.1 8B is materially more capable across the board but uses 2× VRAM. Phi is the right pick only when VRAM matters.
vs Gemma 3 4B → very close call; Gemma 3 4B has a slight edge on multilingual + general chat, Phi 3.5 Mini wins on math + JSON. Both excellent in the 4B class.
vs Phi-4 14B → not in the same class; Phi-4 is competitive with Llama 3.1 8B, Phi-3.5 Mini is a different efficiency tier.

Run this yourself

ollama pull phi3.5:3.8b-mini-instruct-q4_K_M
ollama run phi3.5:3.8b-mini-instruct-q4_K_M

Settings: Q4_K_M GGUF, 4096 ctx, llama.cpp/CUDA, RTX 4090

Quantization	File size	VRAM required
Q4_K_M	2.4 GB	4 GB
Q8_0	4.1 GB	5 GB

Quantization

File size

VRAM required

Q4_K_M

2.4 GB

4 GB

Q8_0

4.1 GB

5 GB

Frequently asked

What's the minimum VRAM to run Phi-3.5 Mini Instruct?

4GB of VRAM is enough to run Phi-3.5 Mini Instruct at the Q4_K_M quantization (file size 2.4 GB). Higher-quality quantizations need more.

Can I use Phi-3.5 Mini Instruct commercially?

Yes — Phi-3.5 Mini Instruct ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of Phi-3.5 Mini Instruct?

Phi-3.5 Mini Instruct supports a context window of 131,072 tokens (about 131K).

How do I install Phi-3.5 Mini Instruct with Ollama?

Run `ollama pull phi3.5:3.8b` to download, then `ollama run phi3.5:3.8b` to start a chat session. The default quantization is Q4_K_M.

Overview

Strengths

Weaknesses

Quantization variants

Get the model

Ollama

HuggingFace

Hardware that runs this

Models worth comparing

Frequently asked

What's the minimum VRAM to run Phi-3.5 Mini Instruct?

Can I use Phi-3.5 Mini Instruct commercially?

What's the context length of Phi-3.5 Mini Instruct?

How do I install Phi-3.5 Mini Instruct with Ollama?