How do you fit a trillion-parameter mind into a phone? The answer involves temperature, soft labels, and one of the most elegant ideas in modern machine learning — teaching a small model to think like a large one.
When GPT-4 predicts the next token, it doesn't just output the single most probable word. It outputs a probability distribution over the entire vocabulary — 100,000+ numbers, each one reflecting some learned relationship about language. A small model trained only on labels ("the correct answer is cat") misses all of that richness. Distillation is how you transfer it.
Geoffrey Hinton coined the term in 2015. The insight was deceptively simple: train the student on the teacher's soft probability outputs, not just the hard correct labels. Those soft probabilities encode the teacher's internal beliefs about similarity, ambiguity, and structure — far richer than a binary right/wrong signal.
Softening the teacher's output distribution with a temperature parameter T reveals probability mass placed on "wrong but similar" answers — the hidden knowledge.
Student learns from the teacher's final output probabilities (logits). The most common form. Works black-box — you only need teacher inference, not internals.
Student mimics the teacher's intermediate hidden representations, not just the output. Richer signal but requires white-box access to the teacher's architecture.
Student learns relationships between examples — the structure of the teacher's embedding space. Useful for metric learning and cross-modal transfer.
Before Hinton, model compression was largely about pruning, quantization, and architectural redesign — mechanical shrinkage. The distillation paper proposed something different: the large model itself is the curriculum. Instead of discarding the teacher's knowledge, you use it as a richer training signal for the student.
The key observation: when a well-trained model predicts a cat image, it doesn't just say "cat: 99%". It says something like "cat: 99%, leopard: 0.7%, dog: 0.2%, tiger: 0.05%". Those tiny probabilities on "wrong" labels encode the model's understanding that cats are more similar to leopards than to cars. That structure is gold — and a hard label "cat: 1, everything else: 0" throws it all away.
The "dark knowledge" concept: Hinton called the information encoded in near-zero probabilities dark knowledge. A model trained on MNIST might assign 10⁻⁶ probability to "2" when the input is clearly a "3" — but that tiny value still encodes the fact that 2 and 3 are more similar than 2 and 7. Over millions of examples, this structure teaches the student far more than binary labels ever could.
The T² factor is subtle but important. When you divide logits by T before softmax, the gradient of the soft loss is scaled down by 1/T². Multiplying by T² cancels this out — ensuring the soft and hard losses contribute at comparable magnitudes regardless of temperature. Hinton specifically calls this out and it's often missed in implementations.
You could minimize MSE between teacher and student logits directly. But KL divergence treats the teacher's distribution as a probability distribution and measures how much information the student loses. It asymmetrically penalizes the student for assigning low probability where the teacher assigned high probability — which is exactly what you want. MSE treats all errors equally regardless of probability mass, making it a worse fit for this task.
The softmax function converts raw logits into probabilities. Normally you just compute exp(z_i) / Σ exp(z_j). Temperature scaling adds one parameter T that controls how "peaked" or "flat" the resulting distribution is:
Teacher's logit distribution
Softmax output at temperature T
← T=0.2 (near hard) T=10 (near uniform) →
Notice how at high T, the model reveals its beliefs about "cat vs leopard" similarity. At T=1, that information is almost entirely hidden in near-zero probability mass.
In practice, the teacher and student both use the same temperature T during distillation training. At inference time, you set T = 1.0 for the deployed student model — the temperature was only a training tool to enrich the supervision signal.
Too low (T < 1): barely different from hard labels. The soft distribution is still very peaked around the argmax. Too high (T > 20): the distribution becomes so uniform it's nearly random noise — no useful signal. T = 3–5 is the standard starting point for most LLM distillation tasks. For tasks with high semantic overlap between classes (like token-level language modeling where many tokens are nearly equivalent), higher T (6–10) often works better.
The student is trained to match the teacher's output layer predictions — logits or probabilities. This is the original Hinton formulation and remains the most common form, especially for LLMs where "output" means the full vocabulary distribution at each position.
Instead of just matching the final output, the student is trained to match the teacher's intermediate hidden representations. Each transformer layer produces a hidden state tensor. Feature distillation adds auxiliary losses that penalize differences between corresponding hidden states in teacher and student.
This requires a white-box teacher — you need access to internal activations. For GPT-4 or Claude, this is impossible. For open models like Llama or Mistral, it's entirely feasible. The tradeoff: richer signal at the cost of requiring teacher internals and often needing projection layers (since teacher and student hidden dimensions differ).
A variant of feature distillation: instead of matching hidden state activations, match the attention weight matrices across layers. The teacher's attention patterns encode which tokens it decided mattered for each prediction. Transferring these patterns teaches the student where to look, not just what to output. Used in TinyBERT and PKD-BERT — often more effective than hidden state matching for downstream task performance.
Instead of matching individual outputs or activations, relation-based distillation matches relationships between examples. For a batch of N examples, you compute an N×N similarity matrix from both teacher and student embeddings and train the student to reproduce the teacher's similarity structure.
This is especially powerful for embedding models and retrieval tasks — it teaches the student to place similar concepts near each other in representation space, even if the absolute coordinates differ. Used extensively in contrastive learning and cross-modal distillation (e.g. distilling CLIP into smaller vision encoders).
Classic Hinton distillation was designed for classification — a fixed-size logit vector. LLMs generate sequences — the output grows token by token, and there's no natural "final output" to match. This creates new challenges, new loss formulations, and new techniques specific to the LLM regime.
Token-level KD (the straightforward extension): for each position in the sequence, compute KL divergence between teacher and student token distributions. This is efficient and directly analogous to Hinton's original method. It's what most open-source distillation implementations do.
Sequence-level KD (SeqKD, Kim & Rush 2016): instead of matching token distributions at each step, first generate complete sequences from the teacher, then train the student on those generated sequences using standard cross-entropy. Simpler — no need to run teacher and student simultaneously — but loses the soft distribution signal, only transferring the teacher's greedy decodes.
There's a subtle but important asymmetry. Forward KL (teacher ‖ student) forces the student to cover all modes the teacher places probability on — even if the student can't model them well. Reverse KL (student ‖ teacher) makes the student seek the sharpest mode-matching fit, often collapsing to covering only the most probable mode.
MiniLLM (Gu et al., 2023) showed that minimizing reverse KL for text generation produces better quality than forward KL, because reverse KL avoids the problem of the student assigning high probability to long-tail sequences the teacher would never produce. The student learns to be sharp and precise rather than trying (and failing) to be comprehensive.
Forward vs Reverse KL Intuition: Imagine the teacher distribution has two modes — "The cat sat" and "A feline rested". Forward KL forces the student to cover both modes, splitting probability mass and producing blurry, averaged outputs. Reverse KL lets the student pick one mode and be very confident about it. For text generation quality, concentrated confidence beats diluted coverage. MiniLLM showed 2–10 point win rates over standard forward-KL distillation across benchmarks.
The most powerful capability of large models isn't their one-shot answers — it's their ability to reason through problems step by step. Chain-of-thought (CoT) distillation transfers this reasoning ability to smaller models by including the teacher's reasoning chains in the training data.
The pipeline: run the teacher (GPT-4, Claude Opus) on reasoning tasks with CoT prompting. Collect the full reasoning traces — not just final answers. Fine-tune the student on (problem, reasoning trace + answer) pairs. The student learns not just what the answer is, but how to get there.
DeepSeek-R1 (2025) is the most dramatic demonstration of CoT distillation to date. The full R1 model uses GRPO reinforcement learning to develop long chain-of-thought reasoning. Then, 800K high-quality CoT examples from R1 are used to fine-tune smaller open-source models (Llama-3.1-8B, Qwen-2.5-7B, etc.). The resulting distilled models match or beat GPT-4o on math and coding benchmarks at 8B parameters — a compression ratio that would have seemed impossible two years ago. The key: the teacher's reasoning chains are so rich that the student doesn't need to rediscover how to reason from scratch.
| Approach | Teacher Access Needed | Signal Quality | Examples |
|---|---|---|---|
| SeqKD / Data Augmentation | Outputs (API access) | Low — hard labels only | Alpaca, WizardLM, Orca |
| Token-Level KD | Output logits | Medium — soft labels | DistilBERT, TinyLLaMA |
| Feature KD | Internal activations | High — representations | TinyBERT, PKD-BERT |
| CoT Distillation | Outputs (API access) | Very High — reasoning traces | DeepSeek-R1-Distill, Phi-3 |
| Online KD | Both models active simultaneously | Highest — dynamic alignment | Research setting |
α = 0 → standard supervised training (no distillation). α = 1 → pure teacher imitation (no ground truth). α = 0.5–0.9 → typical distillation range.
DistiLLM (Ko et al., 2024) unifies multiple loss functions under one framework and proposes a novel "skew" KL divergence that sits between forward and reverse KL, combining their strengths:
This connection is underappreciated. In speculative decoding, a small draft model proposes tokens, and a large verification model accepts or rejects them. Over training time, if you train the draft model to maximize acceptance rate from the verifier, you're essentially doing reverse-KL distillation — the draft learns to match the distribution the verifier would accept.
Models like Medusa and Eagle make this explicit: they train draft heads using the large model's hidden states as supervision. The boundary between "distillation" and "speculative decoding training" is almost nonexistent once you look at the loss functions.
An emerging technique: have a model distill from itself. Take a model's high-temperature samples (exploratory, diverse outputs) and use them as training signal for the same model at low temperature (more deterministic). This iterative self-improvement tightens the model's distribution without any external teacher. Used in LLM constitutional AI pipelines and increasingly in RLHF alternatives.
X-axis: compression ratio (teacher size / student size). Y-axis: % of teacher benchmark score retained. Bubble size = student parameter count.
| Model | Teacher | Student Size | Compression | Distillation Type | Key Result |
|---|---|---|---|---|---|
| DistilBERT | BERT-Base (110M) | 66M | 1.67× | Response + Feature | 97% of BERT on GLUE, 60% faster |
| TinyBERT | BERT-Base (110M) | 14.5M | 7.5× | Attention + Feature | 96.8% of BERT-Base, 9.4× faster |
| MiniLM | BERT-Large | 22M | 16× | Attention relation | 99%+ on MNLI, SQUAD |
| TinyLLaMA | Llama-2 (7B) | 1.1B | 6.4× | SeqKD + token-KD | Competitive on commonsense benchmarks |
| Phi-1 | GPT-3.5 (API) | 1.3B | ~100× | CoT / Data Distillation | 50.6% HumanEval — matches Codex |
| Phi-2 | Multiple (API) | 2.7B | ~25× | Curated synthetic data | Beats Llama-2-13B on most benchmarks |
| DeepSeek-R1-Distill-8B | DeepSeek-R1 (671B) | 8B | 84× | CoT reasoning traces | Matches o1-mini on MATH-500 |
| DeepSeek-R1-Distill-70B | DeepSeek-R1 (671B) | 70B | 9.6× | CoT reasoning traces | 85.0% MATH-500, beats GPT-4o |
DeepSeek-R1 deserves special attention because it represents a qualitative jump in what distillation can achieve. The recipe:
Why these numbers are extraordinary: DeepSeek-R1-Distill-8B achieves 72.6% on AIME 2024 and 89.5% on MATH-500. OpenAI's o1-mini achieves 70% on AIME and 90% on MATH-500. A 671B teacher just transferred near-frontier reasoning ability into an 8B model — an 84× compression — via pure supervised fine-tuning on CoT data, no RL required. This is the most significant distillation result in the history of the field.
The key insight: reasoning ability is highly transferable via examples once you have a teacher that can demonstrate it. The student doesn't need to rediscover how to do chain-of-thought — it learns the pattern from examples the same way humans learn proof techniques from worked examples. The bottleneck was never the student's capacity; it was never having a teacher good enough to demonstrate reasoning in the first place.
Phi-1 (1.3B) showed that 7B of textbook-quality synthetic data beats 1T tokens of web scrapes for coding. Distillation that curates 100K excellent teacher examples consistently beats distillation on 10M mediocre ones. The teacher's quality is the ceiling; curation determines how close you get to it.
Giving the student a finished answer is far less valuable than giving it the reasoning path. Models trained on reasoning traces show strong generalization to novel problems in the same domain — they've learned a process, not just an answer bank. This doesn't work with black-box label distillation.
Empirically, a well-distilled student at N parameters tends to match a non-distilled model at 4N parameters. This varies widely by task and distillation quality, but is a useful starting point when deciding whether distillation is worth the engineering investment vs just training a larger student from scratch.
Distillation works best when teacher and student see the same or very similar training distribution. Using GPT-4 as a math teacher for a student trained on general web text transfers poorly — the student lacks the base representations to absorb the teacher's reasoning signals. Domain alignment first, distillation second.
Too large a teacher-student gap hurts. A student at 0.1× the teacher's capacity struggles to absorb the full distribution — the KL divergence loss becomes dominated by inexpressible probability mass. 0.1×–0.3× capacity ratio is the empirical sweet spot for feature distillation; CoT distillation tolerates larger gaps.
Distillation → quantization is a natural pipeline. Distillation gives you a small model. Quantization makes it fast. Distillation after quantization (QAT-KD) — training the quantized student to recover quality from the full-precision teacher — often recovers most of the accuracy lost from int8/int4 quantization.
Train on API-generated outputs (SeqKD). Get GPT-4 or Claude to generate high-quality responses for your task. Fine-tune your student on those. Not distillation in the strict sense — but often the most practical option.
CoT distillation is the current answer. Use a frontier model (R1, Claude 3.7, Gemini 2.5 Pro) with extended thinking to generate reasoning traces on your task domain. Fine-tune on (problem, trace, answer) triplets. Expect +10–30% on reasoning benchmarks vs standard fine-tuning.
Use token-level KD — log-probability matching at each position. Combine with feature distillation if you can afford the memory. Run teacher and student on the same GPU with mixed precision and gradient checkpointing.
Distillation + quantization pipeline. Distill to 1–3B first. Then QLoRA + QAT-KD to recover accuracy after int4 quantization. Export to GGUF Q4_K_M for llama.cpp. Target 500MB–1GB total model size.
My take: Distillation has moved from a compression technique to a capability transfer technique. The DeepSeek-R1 result changes the calculus entirely — you don't need to train a reasoning model from scratch with expensive RL. You need a frontier teacher that can reason, and enough high-quality examples of that reasoning to fine-tune a small model that mimics it. The next five years will be about systematically mapping what capabilities transfer efficiently and what requires the student to have genuine independent capacity. My bet: factual recall transfers poorly (the student needs its own parametric memory). Reasoning patterns transfer extremely well (the student learns algorithms, not facts).