RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Model Compression
  6. /Ch. 7
Model Compression

07. Teacher-Student Setup

Chapter 7 of 18 · 15 min
KEY INSIGHT

Successful knowledge distillation requires careful architecture selection for the student model, balancing capacity constraints against deployment requirements. The teacher model is typically a pre-trained model already performing well on the target task. Large models with high accuracy serve as effective teachers. The student model must be compact enough for deployment constraints while maintaining sufficient representational capacity. Architecture selection for students differs from standard model design. Pruning-based distillers derive student architectures by removing structural components from the teacher. Low-rank factorizations create students from factorized teacher layers. Manual architectures provide explicit control over student parameters. Pruning-based student extraction offers a principled approach. After magnitude pruning the teacher to target sparsity, the remaining architecture defines the student. Training proceeds with distillation from the original teacher. This approach guarantees the student can represent a compression of the teacher. ```python def extract_student_by_pruning(teacher_model, target_sparsity=0.5): """ Extract student architecture by pruning teacher. Returns a student model with pruned architecture. """ student = type(teacher_model)(teacher_model.config) # Copy pruned teacher state for (t_name, t_param), (s_name, s_param) in zip( teacher_model.named_parameters(), student.named_parameters() ): if 'weight' in t_name: threshold = torch.quantile(t_param.abs(), target_sparsity) mask = t_param.abs() > threshold s_param.data = t_param.data * mask.float() else: s_param.data = t_param.data return student def manual_student_architecture(config): """ Create student architecture manually. Smaller hidden dimensions, fewer layers. """ return StudentModel( embed_dim=config.embed_dim // 2, # Half the teacher capacity num_layers=config.num_layers // 2, # Half the depth num_heads=config.num_heads // 2, # Half the attention heads ff_dim=config.ff_dim // 2, vocab_size=config.vocab_size, ) ``` A critical decision point involves intermediate layer matching. Standard distillation matches only final outputs, but intermediate representations also carry information. Intermediate layer distillation adds losses comparing student's hidden states against corresponding teacher hidden states, providing more gradient signal during training. The capacity gap between teacher and student creates a fundamental tension. Students too similar to teachers offer minimal compression. Students too small cannot learn the teacher's behavior. The optimal student architecture represents the minimum capacity needed to capture essential task performance. A failure mode emerges when the student learns to mimic the teacher without learning the underlying task. This can occur when hard label loss receives insufficient weight or when the teacher's soft targets contain spurious correlations that the student adopts. Regularization and validation monitoring help prevent this degenerate solution.

EXERCISE

Design three student architectures representing different compression levels of a BERT model. Implement parameter counting and measure the capacity of each student relative to the teacher. Identify which architecture might achieve the best accuracy-efficiency tradeoff.

← Chapter 6
Knowledge Distillation
Chapter 8 →
Distillation Loss Functions