Knowledge Distillation
Transfer knowledge from a large teacher model to a smaller student model using temperature scaling.
Knowledge Distillation
Concept Overview
Knowledge Distillation is a model compression technique in which a small "student" model is trained to reproduce the behavior of a larger, pre-trained "teacher" model. Instead of solely learning from the hard ground-truth labels, the student learns from the "soft targets" produced by the teacher. These soft targets contain "dark knowledge"—the relative probabilities of incorrect classes—which reveals the teacher's understanding of similarities between different classes, leading to faster convergence and better generalization for the student.
Mathematical Definition
In standard classification, a neural network outputs logits zi which are converted into probabilities qi using the softmax function. In knowledge distillation, a temperature parameter T is introduced to soften these probabilities:
The total loss function L for the student model is a weighted sum of two components: the standard cross-entropy loss with the hard labels (at T=1) and the distillation loss (Kullback-Leibler divergence) with the teacher's soft targets (at temperature T).
Key Concepts
- Soft Targets: The softened probability distribution output by the teacher model. They provide richer information than hard labels (e.g., indicating that a picture of a dog is more similar to a cat than to a car).
- Dark Knowledge: The hidden structural relationships between classes learned by the teacher model, embedded in the small probabilities assigned to incorrect classes.
- Temperature Scaling: Dividing logits by T before applying softmax. As T increases, the distribution becomes closer to uniform, magnifying the differences between small logit values.
Historical Context
The idea of model compression by transferring knowledge from a large model to a smaller one was first introduced by Rich Caruana and his collaborators in 2006. The concept was later generalized and popularized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their seminal 2015 paper "Distilling the Knowledge in a Neural Network," which formally introduced the temperature scaling mechanism.
Real-world Applications
- Edge Computing: Deploying complex machine learning models on resource-constrained devices like smartphones and IoT sensors.
- Large Language Models: Distilling massive models (like GPT-3) into smaller, faster variants (like DistilBERT) that maintain most of the original performance.
- Ensemble Distillation: Training a single student model to replicate the aggregated predictions of an ensemble of teacher models.
Related Concepts
- Transfer Learning — adapting a pre-trained model to a new task.
- Cross-Entropy Loss — the fundamental metric for classification training.
- Model Pruning — another technique for reducing model size by removing weights.
Experience it interactively
Adjust parameters, observe in real time, and build deep intuition with Riano’s interactive Knowledge Distillation module.
Try Knowledge Distillation on Riano →