08. Distillation Loss Functions

Chapter 8 of 18 · 15 min

KEY INSIGHT

Different distillation loss formulations emphasize different aspects of teacher knowledge, and hybrid formulations typically outperform any single approach. Beyond soft and hard targets, several specialized loss functions extract specific knowledge types from teachers. Response-based distillation matches final outputs. Feature-based distillation matches intermediate representations. Relation-based distillation matches relationships between representations. Feature-based distillation connects intermediate layers. Hidden states in neural networks encode hierarchical features—early layers capture low-level patterns while later layers encode high-level abstractions. Teaching the student to produce similar intermediate representations transfers structured knowledge about feature hierarchies. ```python class FeatureDistillationLoss(nn.Module): """ Match intermediate feature representations between teacher and student. """ def __init__(self, hidden_size_match=True, temperature=2.0): super().__init__() self.temperature = temperature self.hidden_match = hidden_size_match # Projection layer if dimensions differ self.projection = None def forward(self, student_hidden, teacher_hidden, attention_mask=None): """ Args: student_hidden: Student's hidden states [batch, seq, hidden] teacher_hidden: Teacher's hidden states [batch, seq, hidden] """ if student_hidden.shape != teacher_hidden.shape: if self.projection is None: self.projection = nn.Linear( student_hidden.shape[-1], teacher_hidden.shape[-1] ).to(student_hidden.device) student_hidden = self.projection(student_hidden) # Cosine similarity between representations student_norm = F.normalize(student_hidden, p=2, dim=-1) teacher_norm = F.normalize(teacher_hidden, p=2, dim=-1) cosine_sim = (student_norm * teacher_norm).sum(dim=-1) feature_loss = (1 - cosine_sim).mean() return feature_loss * (self.temperature ** 2) ``` Relation-based distillation captures cross-layer relationships. Instead of matching individual representations, this approach matches relationships between representations. Two representations that are similar for the teacher should remain similar for the student. Gram matrices capture these pairwise relationships efficiently. A hybrid loss combines multiple distillation objectives: ```python class HybridDistillationLoss(nn.Module): """ Combines multiple distillation objectives. """ def __init__(self, label_weight=0.3, response_weight=0.3, feature_weight=0.2, relation_weight=0.2): super().__init__() self.label_weight = label_weight self.response_weight = response_weight self.feature_weight = feature_weight self.relation_weight = relation_weight self.response_loss = DistillationLoss() self.feature_loss = FeatureDistillationLoss() self.relation_loss = RelationDistillationLoss() def forward(self, batch): student = self.student(batch) teacher = self.teacher(batch) total_loss = ( self.label_weight * self.compute_label_loss(student.logits, batch.labels) + self.response_weight * self.response_loss( student.logits, teacher.logits ) + self.feature_weight * sum( self.feature_loss(s, t) for s, t in zip(student.hidden_states, teacher.hidden_states) ) + self.relation_weight * self.relation_loss( student.hidden_states, teacher.hidden_states ) ) return total_loss ``` Weight selection for loss components requires empirical tuning. Too much emphasis on soft targets risks mimicking teacher errors. Too much emphasis on hard labels wastes the teacher's generalization signal. Adaptive weighting schemes adjust loss coefficients during training based on validation performance.

EXERCISE

Implement a complete hybrid distillation loss combining response, feature, and relation distillation. Compare each component's contribution to final student performance by ablating one component at a time.