Knowledge Distillation
Knowledge distillation is a technique where a smaller, faster 'student' model is trained to mimic the behavior of a larger, more accurate 'teacher' model. The student learns from the teacher's output probabilities (soft labels) rather than just the ground-truth labels, capturing the teacher's nuanced decision boundaries. In local AI, distillation produces models that fit in consumer VRAM (e.g., 8 GB) while retaining much of the teacher's capability. For example, a distilled 7B model may approach the performance of a 70B teacher on specific tasks, but runs at ~40 tok/s on an RTX 4090 instead of ~5 tok/s.
Deeper dive
Distillation involves three loss components: (1) hard loss against ground-truth labels, (2) soft loss against teacher logits (softened by a temperature parameter), and (3) optionally, a hidden-state alignment loss. The temperature controls how much the student learns from fine-grained class relationships—higher temperature yields softer probability distributions. Common variants include offline distillation (teacher fixed), online distillation (student and teacher co-trained), and self-distillation (same architecture, earlier checkpoint as teacher). For operators, distillation is relevant because many popular local models are distilled: e.g., Microsoft's Phi-3-mini (3.8B) was distilled from a larger model, and Mistral's 7B series uses distillation from larger Mistral models. Distilled models often use the same tokenizer and architecture as the teacher, so they load into existing inference engines (llama.cpp, Ollama) without modification. The trade-off: distilled models may lack the teacher's breadth on rare topics, but they excel at the teacher's training distribution.
Practical example
A concrete example: Microsoft's Phi-3-mini (3.8B) is a distilled model that fits in 4 GB VRAM at Q4. On an RTX 3060 12 GB, it runs at ~50 tok/s, while the teacher (likely a 70B-class model) would require 48 GB VRAM and run at ~2 tok/s. The distilled model scores ~68% on MMLU vs. the teacher's ~80%, but for many local tasks (chat, code generation), the difference is negligible.
Workflow example
When using Hugging Face Transformers, you can load a distilled model like microsoft/Phi-3-mini-4k-instruct with AutoModelForCausalLM.from_pretrained(...). In Ollama, ollama pull phi3:mini downloads the distilled 3.8B model. The runtime treats it like any other model—no special flags needed. The benefit shows in VRAM usage: ollama run phi3:mini uses ~2.5 GB VRAM, leaving room for a 4K context window.
Reviewed by Fredoline Eruo. See our editorial policy.