Knowledge Distillation

Transfer knowledge from a large teacher model to a smaller student model using temperature scaling.

Knowledge Distillation

Concept Overview

Knowledge Distillation is a model compression technique in which a small "student" model is trained to reproduce the behavior of a larger, pre-trained "teacher" model. Instead of solely learning from the hard ground-truth labels, the student learns from the "soft targets" produced by the teacher. These soft targets contain "dark knowledge"—the relative probabilities of incorrect classes—which reveals the teacher's understanding of similarities between different classes, leading to faster convergence and better generalization for the student.

Mathematical Definition

In standard classification, a neural network outputs logits z_i which are converted into probabilities q_i using the softmax function. In knowledge distillation, a temperature parameter T is introduced to soften these probabilities:

q_i = exp(z_i / T) / Σ_j exp(z_j / T)

The total loss function L for the student model is a weighted sum of two components: the standard cross-entropy loss with the hard labels (at T=1) and the distillation loss (Kullback-Leibler divergence) with the teacher's soft targets (at temperature T).

L = α · L_CE(y, σ(z_S)) + (1 - α) · T² · L_KL(σ(z_T/T), σ(z_S/T))

where:

y = ground-truth hard labels

z_S = student logits

z_T = teacher logits

σ(·) = softmax function

α = weight assigned to the hard target loss

T = temperature scaling factor

Key Concepts

Soft Targets: The softened probability distribution output by the teacher model. They provide richer information than hard labels (e.g., indicating that a picture of a dog is more similar to a cat than to a car).
Dark Knowledge: The hidden structural relationships between classes learned by the teacher model, embedded in the small probabilities assigned to incorrect classes.
Temperature Scaling: Dividing logits by T before applying softmax. As T increases, the distribution becomes closer to uniform, magnifying the differences between small logit values.

Historical Context

The idea of model compression by transferring knowledge from a large model to a smaller one was first introduced by Rich Caruana and his collaborators in 2006. The concept was later generalized and popularized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their seminal 2015 paper "Distilling the Knowledge in a Neural Network," which formally introduced the temperature scaling mechanism.

Real-world Applications

Edge Computing: Deploying complex machine learning models on resource-constrained devices like smartphones and IoT sensors.
Large Language Models: Distilling massive models (like GPT-3) into smaller, faster variants (like DistilBERT) that maintain most of the original performance.
Ensemble Distillation: Training a single student model to replicate the aggregated predictions of an ensemble of teacher models.

Related Concepts

Transfer Learning — adapting a pre-trained model to a new task.
Cross-Entropy Loss — the fundamental metric for classification training.
Model Pruning — another technique for reducing model size by removing weights.

Knowledge Distillation