Phi (Microsoft)
Phi is a family of small language models (SLMs) developed by Microsoft, designed to run efficiently on consumer hardware like laptops, phones, and mid-range GPUs. Phi models (Phi-1, Phi-1.5, Phi-2, Phi-3, Phi-3.5) range from 1.3B to 14B parameters and are trained on synthetic data and curated code/text to achieve strong reasoning per parameter. They are often used as drop-in replacements for larger models when VRAM or compute is limited, and are available in quantized formats (GGUF, AWQ) for local inference.
Deeper dive
Microsoft's Phi series targets the gap between tiny models (0.5B) and large models (70B+). Phi-1 (1.3B) was trained on textbook-quality code data; Phi-2 (2.7B) added general text; Phi-3 (3.8B, 7B, 14B) and Phi-3.5 (3.8B, 14B) use a mix of synthetic and filtered web data. The key innovation is training on high-quality synthetic data generated by larger models, which boosts reasoning without scaling parameters. Operators encounter Phi in scenarios where a 7B model must fit in 4 GB VRAM (e.g., RTX 3060 12 GB can run Phi-3 14B Q4_K_M at ~30 tok/s). Phi models support 4K-128K context lengths and are compatible with llama.cpp, Ollama, and MLX.
Practical example
An operator with an RTX 3060 12 GB wants to run a local coding assistant. Phi-3 14B Q4_K_M (8 GB VRAM) fits with room for 4K context, delivering ~30 tok/s. The same card cannot run Llama 3.1 70B Q4_K_M (40 GB) without offloading to system RAM, which drops speed to 5 tok/s. Phi-3 3.8B Q4_K_M (2.5 GB) fits entirely on an Apple M1 8 GB unified memory, running at ~20 tok/s.
Workflow example
In Ollama, an operator runs ollama pull phi3:14b to download the 14B model (~8 GB). The runtime loads it into VRAM; if VRAM is insufficient, Ollama offloads layers to system RAM, reducing speed. In llama.cpp, the command ./main -m phi-3-mini-4k-instruct-q4_K_M.gguf -p "Write a Python script" runs entirely on CPU if no GPU offload is configured, achieving ~10 tok/s on a 16 GB RAM laptop. In LM Studio, the operator selects Phi-3 from the model hub and monitors VRAM usage in the status bar.
Reviewed by Fredoline Eruo. See our editorial policy.