07. Teacher-Student Setup

Chapter 7 of 18 · 15 min

KEY INSIGHT

Successful knowledge distillation requires careful architecture selection for the student model, balancing capacity constraints against deployment requirements. The teacher model is typically a pre-trained model already performing well on the target task. Large models with high accuracy serve as effective teachers. The student model must be compact enough for deployment constraints while maintaining sufficient representational capacity. Architecture selection for students differs from standard model design. Pruning-based distillers derive student architectures by removing structural components from the teacher. Low-rank factorizations create students from factorized teacher layers. Manual architectures provide explicit control over student parameters. Pruning-based student extraction offers a principled approach. After magnitude pruning the teacher to target sparsity, the remaining architecture defines the student. Training proceeds with distillation from the original teacher. This approach guarantees the student can represent a compression of the teacher. ```python def extract_student_by_pruning(teacher_model, target_sparsity=0.5): """ Extract student architecture by pruning teacher. Returns a student model with pruned architecture. """ student = type(teacher_model)(teacher_model.config) # Copy pruned teacher state for (t_name, t_param), (s_name, s_param) in zip( teacher_model.named_parameters(), student.named_parameters() ): if 'weight' in t_name: threshold = torch.quantile(t_param.abs(), target_sparsity) mask = t_param.abs() > threshold s_param.data = t_param.data * mask.float() else: s_param.data = t_param.data return student def manual_student_architecture(config): """ Create student architecture manually. Smaller hidden dimensions, fewer layers. """ return StudentModel( embed_dim=config.embed_dim // 2, # Half the teacher capacity num_layers=config.num_layers // 2, # Half the depth num_heads=config.num_heads // 2, # Half the attention heads ff_dim=config.ff_dim // 2, vocab_size=config.vocab_size, ) ``` A critical decision point involves intermediate layer matching. Standard distillation matches only final outputs, but intermediate representations also carry information. Intermediate layer distillation adds losses comparing student's hidden states against corresponding teacher hidden states, providing more gradient signal during training. The capacity gap between teacher and student creates a fundamental tension. Students too similar to teachers offer minimal compression. Students too small cannot learn the teacher's behavior. The optimal student architecture represents the minimum capacity needed to capture essential task performance. A failure mode emerges when the student learns to mimic the teacher without learning the underlying task. This can occur when hard label loss receives insufficient weight or when the teacher's soft targets contain spurious correlations that the student adopts. Regularization and validation monitoring help prevent this degenerate solution.

EXERCISE

Design three student architectures representing different compression levels of a BERT model. Implement parameter counting and measure the capacity of each student relative to the teacher. Identify which architecture might achieve the best accuracy-efficiency tradeoff.