RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Large language models / Distillation
Large language models

Distillation

Distillation is a training technique where a smaller 'student' model learns to mimic the behavior of a larger 'teacher' model. The student is trained on the teacher's output probabilities (soft labels) rather than just hard ground-truth labels, capturing the teacher's generalization patterns. In local AI, distillation is relevant because it produces smaller models that run faster and use less VRAM while retaining much of the teacher's accuracy. For example, a distilled 7B model can approach the performance of a 13B teacher, making it feasible on consumer GPUs.

Deeper dive

Distillation, introduced by Hinton et al. (2015), involves training the student on a combination of the teacher's softened probability distribution (using a temperature parameter) and the original hard labels. The temperature controls how much the student focuses on the fine-grained similarities between classes. Variants include black-box distillation (only teacher outputs) and white-box distillation (access to intermediate layers). In practice, distillation is often combined with quantization to further shrink model size. For operators, distilled models like DistilBERT, TinyLlama, or Phi-3-mini offer a practical trade-off: they run on lower-end hardware (e.g., 4 GB VRAM) while delivering usable quality for tasks like summarization or code generation.

Practical example

A 7B model distilled from Llama 3.1 70B might fit in 6 GB VRAM at Q4, whereas the teacher requires 40 GB. On an RTX 3060 (12 GB), the distilled model runs at ~30 tok/s, while the teacher cannot load without offloading to system RAM, dropping to ~2 tok/s.

Workflow example

When using Hugging Face Transformers, you might load a distilled model like distilbert-base-uncased with AutoModel.from_pretrained('distilbert-base-uncased'). In Ollama, you can pull phi3:mini (a distilled model) with ollama pull phi3:mini and run it locally, seeing lower VRAM usage compared to a full-size model. The trade-off is slightly lower accuracy on complex reasoning tasks.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →