Large language models

Distillation

Distillation is a training technique where a smaller 'student' model learns to mimic the behavior of a larger 'teacher' model. The student is trained on the teacher's output probabilities (soft labels) rather than just hard ground-truth labels, capturing the teacher's generalization patterns. In local AI, distillation is relevant because it produces smaller models that run faster and use less VRAM while retaining much of the teacher's accuracy. For example, a distilled 7B model can approach the performance of a 13B teacher, making it feasible on consumer GPUs.

Deeper dive

Distillation, introduced by Hinton et al. (2015), involves training the student on a combination of the teacher's softened probability distribution (using a temperature parameter) and the original hard labels. The temperature controls how much the student focuses on the fine-grained similarities between classes. Variants include black-box distillation (only teacher outputs) and white-box distillation (access to intermediate layers). In practice, distillation is often combined with quantization to further shrink model size. For operators, distilled models like DistilBERT, TinyLlama, or Phi-3-mini offer a practical trade-off: they run on lower-end hardware (e.g., 4 GB VRAM) while delivering usable quality for tasks like summarization or code generation.

Practical example

A 7B model distilled from Llama 3.1 70B might fit in 6 GB VRAM at Q4, whereas the teacher requires 40 GB. On an RTX 3060 (12 GB), the distilled model runs at ~30 tok/s, while the teacher cannot load without offloading to system RAM, dropping to ~2 tok/s.

Workflow example

When using Hugging Face Transformers, you might load a distilled model like distilbert-base-uncased with AutoModel.from_pretrained('distilbert-base-uncased'). In Ollama, you can pull phi3:mini (a distilled model) with ollama pull phi3:mini and run it locally, seeing lower VRAM usage compared to a full-size model. The trade-off is slightly lower accuracy on complex reasoning tasks.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work