RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Model Compression
  6. /Ch. 8
Model Compression

08. Distillation Loss Functions

Chapter 8 of 18 · 15 min
KEY INSIGHT

Different distillation loss formulations emphasize different aspects of teacher knowledge, and hybrid formulations typically outperform any single approach. Beyond soft and hard targets, several specialized loss functions extract specific knowledge types from teachers. Response-based distillation matches final outputs. Feature-based distillation matches intermediate representations. Relation-based distillation matches relationships between representations. Feature-based distillation connects intermediate layers. Hidden states in neural networks encode hierarchical features—early layers capture low-level patterns while later layers encode high-level abstractions. Teaching the student to produce similar intermediate representations transfers structured knowledge about feature hierarchies. ```python class FeatureDistillationLoss(nn.Module): """ Match intermediate feature representations between teacher and student. """ def __init__(self, hidden_size_match=True, temperature=2.0): super().__init__() self.temperature = temperature self.hidden_match = hidden_size_match # Projection layer if dimensions differ self.projection = None def forward(self, student_hidden, teacher_hidden, attention_mask=None): """ Args: student_hidden: Student's hidden states [batch, seq, hidden] teacher_hidden: Teacher's hidden states [batch, seq, hidden] """ if student_hidden.shape != teacher_hidden.shape: if self.projection is None: self.projection = nn.Linear( student_hidden.shape[-1], teacher_hidden.shape[-1] ).to(student_hidden.device) student_hidden = self.projection(student_hidden) # Cosine similarity between representations student_norm = F.normalize(student_hidden, p=2, dim=-1) teacher_norm = F.normalize(teacher_hidden, p=2, dim=-1) cosine_sim = (student_norm * teacher_norm).sum(dim=-1) feature_loss = (1 - cosine_sim).mean() return feature_loss * (self.temperature ** 2) ``` Relation-based distillation captures cross-layer relationships. Instead of matching individual representations, this approach matches relationships between representations. Two representations that are similar for the teacher should remain similar for the student. Gram matrices capture these pairwise relationships efficiently. A hybrid loss combines multiple distillation objectives: ```python class HybridDistillationLoss(nn.Module): """ Combines multiple distillation objectives. """ def __init__(self, label_weight=0.3, response_weight=0.3, feature_weight=0.2, relation_weight=0.2): super().__init__() self.label_weight = label_weight self.response_weight = response_weight self.feature_weight = feature_weight self.relation_weight = relation_weight self.response_loss = DistillationLoss() self.feature_loss = FeatureDistillationLoss() self.relation_loss = RelationDistillationLoss() def forward(self, batch): student = self.student(batch) teacher = self.teacher(batch) total_loss = ( self.label_weight * self.compute_label_loss(student.logits, batch.labels) + self.response_weight * self.response_loss( student.logits, teacher.logits ) + self.feature_weight * sum( self.feature_loss(s, t) for s, t in zip(student.hidden_states, teacher.hidden_states) ) + self.relation_weight * self.relation_loss( student.hidden_states, teacher.hidden_states ) ) return total_loss ``` Weight selection for loss components requires empirical tuning. Too much emphasis on soft targets risks mimicking teacher errors. Too much emphasis on hard labels wastes the teacher's generalization signal. Adaptive weighting schemes adjust loss coefficients during training based on validation performance.

EXERCISE

Implement a complete hybrid distillation loss combining response, feature, and relation distillation. Compare each component's contribution to final student performance by ablating one component at a time.

← Chapter 7
Teacher-Student Setup
Chapter 9 →
Distillation at Scale