phi
3.8B parameters
Commercial OK

Phi-3.5 Mini Instruct

Compact 3.8B Phi for edge deployment. 128K context. Strong reasoning per parameter.

License: MIT·Released Aug 20, 2024·Context: 131,072 tokens
Our verdict
By Fredoline Eruo·Last verified May 6, 2026
7.2/10
Positioning

The right pick when VRAM is the gating constraint — sub-6 GB cards, integrated GPUs, edge devices, or as a fast secondary model for routing/classification in agent loops. Microsoft's curation against synthetic textbooks shows: it's startlingly capable for 3.8B parameters.

Strengths
  • 2.3 GB at Q4_K_M — runs on essentially anything that exists, including 4 GB GPUs with comfortable context.
  • Structured output and math are genuinely good for the size class — better than Llama 3.2 3B on GSM8K and JSON-mode tasks.
  • MIT license: cleanest license in the curated-data model space.
Limitations
  • Open-domain knowledge is shallow — the textbook-only training shows on pop culture, recent events, and obscure technical lore.
  • Refusal behavior is aggressive — defaults to over-cautious answers on anything dual-use.
  • Long-context recall is weak despite the 128K spec — past ~16K, quality degrades sharply.
Real-world performance on RTX 4090
  • Q4_K_M (2.3 GB): 130–155 tok/s decode, TTFT under 50 ms
  • Q5_K_M (2.8 GB): 120–140 tok/s
  • Q8_0 (4.1 GB): 100–120 tok/s — surprisingly worth it; Q8 quality bump is larger than usual
Should you run this locally?

Yes, for edge deployment, fast routing/classification in agent stacks, math-heavy structured tasks, or any rig with under 6 GB VRAM. No, for open-ended chat, creative writing, or current-events tasks.

How it compares
  • vs Llama 3.2 3B → Phi wins on math + structured output; Llama wins on conversational naturalness and knowledge breadth. Pick Phi for tooling, Llama for chat.
  • vs Llama 3.1 8B → Llama 3.1 8B is materially more capable across the board but uses 2× VRAM. Phi is the right pick only when VRAM matters.
  • vs Gemma 3 4B → very close call; Gemma 3 4B has a slight edge on multilingual + general chat, Phi 3.5 Mini wins on math + JSON. Both excellent in the 4B class.
  • vs Phi-4 14B → not in the same class; Phi-4 is competitive with Llama 3.1 8B, Phi-3.5 Mini is a different efficiency tier.
Run this yourself
ollama pull phi3.5:3.8b-mini-instruct-q4_K_M
ollama run phi3.5:3.8b-mini-instruct-q4_K_M
Settings: Q4_K_M GGUF, 4096 ctx, llama.cpp/CUDA, RTX 4090
Why this rating

7.2/10 — punches well above its parameter count, especially on math and structured output. Loses points to general models with 2× the params for general chat, but no other 4B-class model is in this league.

Overview

Compact 3.8B Phi for edge deployment. 128K context. Strong reasoning per parameter.

Strengths

  • MIT license
  • 128K context
  • Edge-class footprint

Weaknesses

  • Heavy refusals
  • Synthetic-data quirks

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M2.4 GB4 GB
Q8_04.1 GB5 GB

Get the model

Ollama

One-line install

ollama run phi3.5:3.8bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/microsoft/Phi-3.5-mini-instruct

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Phi-3.5 Mini Instruct.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier
Models in the same parameter band as this one
Step down
Smaller — faster, runs on weaker hardware

Frequently asked

What's the minimum VRAM to run Phi-3.5 Mini Instruct?

4GB of VRAM is enough to run Phi-3.5 Mini Instruct at the Q4_K_M quantization (file size 2.4 GB). Higher-quality quantizations need more.

Can I use Phi-3.5 Mini Instruct commercially?

Yes — Phi-3.5 Mini Instruct ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of Phi-3.5 Mini Instruct?

Phi-3.5 Mini Instruct supports a context window of 131,072 tokens (about 131K).

How do I install Phi-3.5 Mini Instruct with Ollama?

Run `ollama pull phi3.5:3.8b` to download, then `ollama run phi3.5:3.8b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/microsoft/Phi-3.5-mini-instruct

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.