RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Notable models & companies / Phi (Microsoft)
Notable models & companies

Phi (Microsoft)

Phi is a family of small language models (SLMs) developed by Microsoft, designed to run efficiently on consumer hardware like laptops, phones, and mid-range GPUs. Phi models (Phi-1, Phi-1.5, Phi-2, Phi-3, Phi-3.5) range from 1.3B to 14B parameters and are trained on synthetic data and curated code/text to achieve strong reasoning per parameter. They are often used as drop-in replacements for larger models when VRAM or compute is limited, and are available in quantized formats (GGUF, AWQ) for local inference.

Deeper dive

Microsoft's Phi series targets the gap between tiny models (0.5B) and large models (70B+). Phi-1 (1.3B) was trained on textbook-quality code data; Phi-2 (2.7B) added general text; Phi-3 (3.8B, 7B, 14B) and Phi-3.5 (3.8B, 14B) use a mix of synthetic and filtered web data. The key innovation is training on high-quality synthetic data generated by larger models, which boosts reasoning without scaling parameters. Operators encounter Phi in scenarios where a 7B model must fit in 4 GB VRAM (e.g., RTX 3060 12 GB can run Phi-3 14B Q4_K_M at ~30 tok/s). Phi models support 4K-128K context lengths and are compatible with llama.cpp, Ollama, and MLX.

Practical example

An operator with an RTX 3060 12 GB wants to run a local coding assistant. Phi-3 14B Q4_K_M (8 GB VRAM) fits with room for 4K context, delivering ~30 tok/s. The same card cannot run Llama 3.1 70B Q4_K_M (40 GB) without offloading to system RAM, which drops speed to 5 tok/s. Phi-3 3.8B Q4_K_M (2.5 GB) fits entirely on an Apple M1 8 GB unified memory, running at ~20 tok/s.

Workflow example

In Ollama, an operator runs ollama pull phi3:14b to download the 14B model (~8 GB). The runtime loads it into VRAM; if VRAM is insufficient, Ollama offloads layers to system RAM, reducing speed. In llama.cpp, the command ./main -m phi-3-mini-4k-instruct-q4_K_M.gguf -p "Write a Python script" runs entirely on CPU if no GPU offload is configured, achieving ~10 tok/s on a 16 GB RAM laptop. In LM Studio, the operator selects Phi-3 from the model hub and monitors VRAM usage in the status bar.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →