Deep Dive · Model Compression

Knowledge
Distillation
in LLMs

How do you fit a trillion-parameter mind into a phone? The answer involves temperature, soft labels, and one of the most elegant ideas in modern machine learning — teaching a small model to think like a large one.

March 2026 · Prateek Singh, PhD
Distillation Model Compression LLM On-Device AI
scroll

A large model contains more knowledge
than its outputs reveal

When GPT-4 predicts the next token, it doesn't just output the single most probable word. It outputs a probability distribution over the entire vocabulary — 100,000+ numbers, each one reflecting some learned relationship about language. A small model trained only on labels ("the correct answer is cat") misses all of that richness. Distillation is how you transfer it.

Geoffrey Hinton coined the term in 2015. The insight was deceptively simple: train the student on the teacher's soft probability outputs, not just the hard correct labels. Those soft probabilities encode the teacher's internal beliefs about similarity, ambiguity, and structure — far richer than a binary right/wrong signal.

🌡️

Temperature Scaling

Softening the teacher's output distribution with a temperature parameter T reveals probability mass placed on "wrong but similar" answers — the hidden knowledge.

🎓

Response Distillation

Student learns from the teacher's final output probabilities (logits). The most common form. Works black-box — you only need teacher inference, not internals.

🧠

Feature Distillation

Student mimics the teacher's intermediate hidden representations, not just the output. Richer signal but requires white-box access to the teacher's architecture.

🔗

Relation Distillation

Student learns relationships between examples — the structure of the teacher's embedding space. Useful for metric learning and cross-modal transfer.

01
Hinton's Original Insight
Distilling the Knowledge in a Neural Network, 2015
Hinton, Vinyals, Dean NeurIPS 2015 Foundational paper

Before Hinton, model compression was largely about pruning, quantization, and architectural redesign — mechanical shrinkage. The distillation paper proposed something different: the large model itself is the curriculum. Instead of discarding the teacher's knowledge, you use it as a richer training signal for the student.

The key observation: when a well-trained model predicts a cat image, it doesn't just say "cat: 99%". It says something like "cat: 99%, leopard: 0.7%, dog: 0.2%, tiger: 0.05%". Those tiny probabilities on "wrong" labels encode the model's understanding that cats are more similar to leopards than to cars. That structure is gold — and a hard label "cat: 1, everything else: 0" throws it all away.

💡

The "dark knowledge" concept: Hinton called the information encoded in near-zero probabilities dark knowledge. A model trained on MNIST might assign 10⁻⁶ probability to "2" when the input is clearly a "3" — but that tiny value still encodes the fact that 2 and 3 are more similar than 2 and 7. Over millions of examples, this structure teaches the student far more than binary labels ever could.

Distillation Objective — Combined Loss
Hard loss — student vs ground truth labels:
Lhard = CrossEntropy( student_logits, true_labels )

Soft loss — student vs teacher soft targets (at temperature T):
Lsoft = T² · KL( σ(z_s/T)σ(z_t/T) )

Combined objective (α controls the balance):
L = (1−α) · Lhard + α · Lsoft

where z_s = student logits
       z_t = teacher logits (frozen, no gradient)
       σ(z/T) = softmax with temperature T
       T² factor restores gradient magnitude lost by softening

The T² factor is subtle but important. When you divide logits by T before softmax, the gradient of the soft loss is scaled down by 1/T². Multiplying by T² cancels this out — ensuring the soft and hard losses contribute at comparable magnitudes regardless of temperature. Hinton specifically calls this out and it's often missed in implementations.

Standard Distillation Setup
TEACHER (frozen)
Large Model (e.g. GPT-4-70B)
Logits z_t (÷ T)
Soft probabilities p_t
STUDENT (training)
Small Model (e.g. 1.7B)
Logits z_s (÷ T)
Soft probabilities p_s
KL(p_s ‖ p_t) · T²
+
CrossEntropy(z_s, y_true)
=
L_distill → backprop into student only

Why KL Divergence, Not MSE?

You could minimize MSE between teacher and student logits directly. But KL divergence treats the teacher's distribution as a probability distribution and measures how much information the student loses. It asymmetrically penalizes the student for assigning low probability where the teacher assigned high probability — which is exactly what you want. MSE treats all errors equally regardless of probability mass, making it a worse fit for this task.

02
The Temperature Trick
How to make a model reveal what it almost said
T = 1.0 T → ∞ : uniform T → 0 : one-hot

The softmax function converts raw logits into probabilities. Normally you just compute exp(z_i) / Σ exp(z_j). Temperature scaling adds one parameter T that controls how "peaked" or "flat" the resulting distribution is:

Temperature-Scaled Softmax
pi(T) = exp(z_i / T) / Σj exp(z_j / T)

T = 1.0 → standard softmax (peaked distribution)
T → ∞ → uniform distribution (maximum entropy)
T → 0 → one-hot argmax (hard labels)
T = 4–8 → typical distillation range (soft but informative)
Temperature Effect on Soft Labels T = 1.0

Teacher's logit distribution

Softmax output at temperature T

← T=0.2 (near hard)       T=10 (near uniform) →

Notice how at high T, the model reveals its beliefs about "cat vs leopard" similarity. At T=1, that information is almost entirely hidden in near-zero probability mass.

In practice, the teacher and student both use the same temperature T during distillation training. At inference time, you set T = 1.0 for the deployed student model — the temperature was only a training tool to enrich the supervision signal.

Choosing Temperature in Practice

Too low (T < 1): barely different from hard labels. The soft distribution is still very peaked around the argmax. Too high (T > 20): the distribution becomes so uniform it's nearly random noise — no useful signal. T = 3–5 is the standard starting point for most LLM distillation tasks. For tasks with high semantic overlap between classes (like token-level language modeling where many tokens are nearly equivalent), higher T (6–10) often works better.

03
Three Flavors of Knowledge
Response, Feature, and Relation — what exactly gets transferred

The student is trained to match the teacher's output layer predictions — logits or probabilities. This is the original Hinton formulation and remains the most common form, especially for LLMs where "output" means the full vocabulary distribution at each position.

Response-Based — Output Matching
TEACHER
Layer N (output)
Layer 2
Layer 1
Input tokens
KL Divergence
⟵⟶
output only
STUDENT
Layer M (output)
Layer 2
Layer 1
Input tokens

Instead of just matching the final output, the student is trained to match the teacher's intermediate hidden representations. Each transformer layer produces a hidden state tensor. Feature distillation adds auxiliary losses that penalize differences between corresponding hidden states in teacher and student.

This requires a white-box teacher — you need access to internal activations. For GPT-4 or Claude, this is impossible. For open models like Llama or Mistral, it's entirely feasible. The tradeoff: richer signal at the cost of requiring teacher internals and often needing projection layers (since teacher and student hidden dimensions differ).

Feature Distillation Loss — FitNets / PKD Style
Lfeat = Σl ∈ mapped_layersh_s^l − W · h_t^l ‖²

where h_s^l = student hidden state at layer l (dim d_s)
       h_t^l = teacher hidden state at mapped layer (dim d_t)
       W = learned linear projection (d_s → d_t)

Common layer mappings (24-layer teacher → 6-layer student):
t_layer 4 ↔ s_layer 1
t_layer 8 ↔ s_layer 2
t_layer 12 ↔ s_layer 3 ← DistilBERT uses this pattern

Attention Map Distillation

A variant of feature distillation: instead of matching hidden state activations, match the attention weight matrices across layers. The teacher's attention patterns encode which tokens it decided mattered for each prediction. Transferring these patterns teaches the student where to look, not just what to output. Used in TinyBERT and PKD-BERT — often more effective than hidden state matching for downstream task performance.

Instead of matching individual outputs or activations, relation-based distillation matches relationships between examples. For a batch of N examples, you compute an N×N similarity matrix from both teacher and student embeddings and train the student to reproduce the teacher's similarity structure.

This is especially powerful for embedding models and retrieval tasks — it teaches the student to place similar concepts near each other in representation space, even if the absolute coordinates differ. Used extensively in contrastive learning and cross-modal distillation (e.g. distilling CLIP into smaller vision encoders).

04
Distillation at Scale
New problems that emerge when the teacher has 100B+ parameters
Black-box teachers Autoregressive KD Chain-of-thought

Classic Hinton distillation was designed for classification — a fixed-size logit vector. LLMs generate sequences — the output grows token by token, and there's no natural "final output" to match. This creates new challenges, new loss formulations, and new techniques specific to the LLM regime.

Token-level KD (the straightforward extension): for each position in the sequence, compute KL divergence between teacher and student token distributions. This is efficient and directly analogous to Hinton's original method. It's what most open-source distillation implementations do.

Sequence-level KD (SeqKD, Kim & Rush 2016): instead of matching token distributions at each step, first generate complete sequences from the teacher, then train the student on those generated sequences using standard cross-entropy. Simpler — no need to run teacher and student simultaneously — but loses the soft distribution signal, only transferring the teacher's greedy decodes.

Token-Level vs Sequence-Level KD
Token-level KD (forward KL at each position t):
LTKD = Σt=1T KL( p_teacher(·|x, y<t) ‖ p_student(·|x, y<t) )

Sequence-level KD (train on teacher outputs directly):
LSKD = −Σt=1T log p_student( ŷ_t | x, ŷ<t )
where ŷ = sequences sampled from teacher

There's a subtle but important asymmetry. Forward KL (teacher ‖ student) forces the student to cover all modes the teacher places probability on — even if the student can't model them well. Reverse KL (student ‖ teacher) makes the student seek the sharpest mode-matching fit, often collapsing to covering only the most probable mode.

MiniLLM (Gu et al., 2023) showed that minimizing reverse KL for text generation produces better quality than forward KL, because reverse KL avoids the problem of the student assigning high probability to long-tail sequences the teacher would never produce. The student learns to be sharp and precise rather than trying (and failing) to be comprehensive.

📐

Forward vs Reverse KL Intuition: Imagine the teacher distribution has two modes — "The cat sat" and "A feline rested". Forward KL forces the student to cover both modes, splitting probability mass and producing blurry, averaged outputs. Reverse KL lets the student pick one mode and be very confident about it. For text generation quality, concentrated confidence beats diluted coverage. MiniLLM showed 2–10 point win rates over standard forward-KL distillation across benchmarks.

The most powerful capability of large models isn't their one-shot answers — it's their ability to reason through problems step by step. Chain-of-thought (CoT) distillation transfers this reasoning ability to smaller models by including the teacher's reasoning chains in the training data.

The pipeline: run the teacher (GPT-4, Claude Opus) on reasoning tasks with CoT prompting. Collect the full reasoning traces — not just final answers. Fine-tune the student on (problem, reasoning trace + answer) pairs. The student learns not just what the answer is, but how to get there.

DeepSeek-R1 — The Landmark CoT Distillation

DeepSeek-R1 (2025) is the most dramatic demonstration of CoT distillation to date. The full R1 model uses GRPO reinforcement learning to develop long chain-of-thought reasoning. Then, 800K high-quality CoT examples from R1 are used to fine-tune smaller open-source models (Llama-3.1-8B, Qwen-2.5-7B, etc.). The resulting distilled models match or beat GPT-4o on math and coding benchmarks at 8B parameters — a compression ratio that would have seemed impossible two years ago. The key: the teacher's reasoning chains are so rich that the student doesn't need to rediscover how to reason from scratch.

ApproachTeacher Access NeededSignal QualityExamples
SeqKD / Data Augmentation Outputs (API access) Low — hard labels only Alpaca, WizardLM, Orca
Token-Level KD Output logits Medium — soft labels DistilBERT, TinyLLaMA
Feature KD Internal activations High — representations TinyBERT, PKD-BERT
CoT Distillation Outputs (API access) Very High — reasoning traces DeepSeek-R1-Distill, Phi-3
Online KD Both models active simultaneously Highest — dynamic alignment Research setting
05
The Loss Cookbook
Every distillation method boils down to what you're minimizing
Hard vs Soft Loss Balance (α) α = 0.7
Soft Loss (α·L_KD)
Hard Loss ((1−α)·CE)
70% soft 30% hard

α = 0 → standard supervised training (no distillation). α = 1 → pure teacher imitation (no ground truth). α = 0.5–0.9 → typical distillation range.

DistiLLM (Ko et al., 2024) unifies multiple loss functions under one framework and proposes a novel "skew" KL divergence that sits between forward and reverse KL, combining their strengths:

DistiLLM — Skew KL Divergence
LskewKL(β) = KL( β·p_s + (1-β)·p_tp_t )

β = 0 → forward KL (covers all teacher modes)
β = 1 → reverse KL (seeks sharpest mode)
β = 0.1 → DistiLLM default (best empirical performance)

At β=0.1, the mixture distribution is mostly teacher — so the
loss is a mild variant of forward KL, but with reduced mode-covering
pressure that avoids blurry outputs.

This connection is underappreciated. In speculative decoding, a small draft model proposes tokens, and a large verification model accepts or rejects them. Over training time, if you train the draft model to maximize acceptance rate from the verifier, you're essentially doing reverse-KL distillation — the draft learns to match the distribution the verifier would accept.

Models like Medusa and Eagle make this explicit: they train draft heads using the large model's hidden states as supervision. The boundary between "distillation" and "speculative decoding training" is almost nonexistent once you look at the loss functions.

The Self-Distillation Loop

An emerging technique: have a model distill from itself. Take a model's high-temperature samples (exploratory, diverse outputs) and use them as training signal for the same model at low temperature (more deterministic). This iterative self-improvement tightens the model's distribution without any external teacher. Used in LLM constitutional AI pipelines and increasingly in RLHF alternatives.

From DistilBERT to DeepSeek-R1

Compression Ratio vs Performance Retention
Encoder (BERT-family)
Decoder / LLM
CoT Distilled

X-axis: compression ratio (teacher size / student size). Y-axis: % of teacher benchmark score retained. Bubble size = student parameter count.

ModelTeacherStudent SizeCompressionDistillation TypeKey Result
DistilBERT BERT-Base (110M) 66M 1.67× Response + Feature 97% of BERT on GLUE, 60% faster
TinyBERT BERT-Base (110M) 14.5M 7.5× Attention + Feature 96.8% of BERT-Base, 9.4× faster
MiniLM BERT-Large 22M 16× Attention relation 99%+ on MNLI, SQUAD
TinyLLaMA Llama-2 (7B) 1.1B 6.4× SeqKD + token-KD Competitive on commonsense benchmarks
Phi-1 GPT-3.5 (API) 1.3B ~100× CoT / Data Distillation 50.6% HumanEval — matches Codex
Phi-2 Multiple (API) 2.7B ~25× Curated synthetic data Beats Llama-2-13B on most benchmarks
DeepSeek-R1-Distill-8B DeepSeek-R1 (671B) 8B 84× CoT reasoning traces Matches o1-mini on MATH-500
DeepSeek-R1-Distill-70B DeepSeek-R1 (671B) 70B 9.6× CoT reasoning traces 85.0% MATH-500, beats GPT-4o

DeepSeek-R1 deserves special attention because it represents a qualitative jump in what distillation can achieve. The recipe:

DeepSeek-R1 Distillation Pipeline
DeepSeek-R1 (671B MoE) — GRPO-trained reasoning model
Generate 800K CoT examples across math, code, science, logic
Filter: remove incorrect answers, too-short chains, format violations
~600K high-quality (problem, thinking_process, answer) triplets
SFT fine-tuning on Qwen-2.5-7B / Llama-3.1-8B / Llama-3.3-70B
R1-Distill-7B
R1-Distill-8B
R1-Distill-14B
R1-Distill-70B
🔬

Why these numbers are extraordinary: DeepSeek-R1-Distill-8B achieves 72.6% on AIME 2024 and 89.5% on MATH-500. OpenAI's o1-mini achieves 70% on AIME and 90% on MATH-500. A 671B teacher just transferred near-frontier reasoning ability into an 8B model — an 84× compression — via pure supervised fine-tuning on CoT data, no RL required. This is the most significant distillation result in the history of the field.

The key insight: reasoning ability is highly transferable via examples once you have a teacher that can demonstrate it. The student doesn't need to rediscover how to do chain-of-thought — it learns the pattern from examples the same way humans learn proof techniques from worked examples. The bottleneck was never the student's capacity; it was never having a teacher good enough to demonstrate reasoning in the first place.

How Distillation Evolved

2006
Model Compression — Bucilua et al.
First paper on training a small model on soft outputs of a large ensemble. The precursor idea — but without temperature scaling or the modern formulation.
Ensemble Compression
Mar 2015
Distilling the Knowledge in a Neural Network
Hinton, Vinyals, Dean formalize distillation with temperature scaling, the T² correction, and the combined hard/soft loss. The foundational paper. Still the starting point for every practitioner.
KDTemperatureDark Knowledge
Oct 2016
FitNets — Feature-Based Distillation
Romero et al. show students can be trained to mimic intermediate hidden representations (hints), enabling deeper/thinner students to outperform shallower/wider ones.
FitNetsFeature KD
Oct 2019
DistilBERT & TinyBERT
First successful application to BERT-scale transformers. DistilBERT uses response + feature KD. TinyBERT adds attention matrix distillation. Both become production standards.
DistilBERTTinyBERT
Jun 2023
MiniLLM — Reverse KL for Generation
Shows that minimizing reverse KL (student ‖ teacher) outperforms forward KL for autoregressive generation. Reframes the loss function design space for the LLM era.
MiniLLMReverse KL
Dec 2023
Phi-2 — Textbook-Quality Data Wins
Microsoft shows 2.7B model matches 7B+ models via curated synthetic training data from GPT-4. Redefines what "distillation" means — it's not just about loss functions but data curation.
Phi-2Synthetic Data
Apr 2024
DistiLLM — Skew KL Unification
Proposes skew KL divergence as a unified objective that interpolates between forward and reverse KL. Achieves consistent wins across model sizes and tasks.
DistiLLMSkew KL
Jan 2025
DeepSeek-R1 Distillation
800K reasoning traces from R1 fine-tune Llama and Qwen base models. R1-Distill-8B matches o1-mini. Proves CoT distillation can transfer frontier reasoning at 84× compression.
DeepSeek-R1CoT KDReasoning
2025+
The Reasoning Distillation Era
Every frontier lab now has distilled reasoning variants. The question shifts from "can we distill reasoning?" to "how much reasoning can we pack into 1B parameters?"
Qwen-DistillGemma-3-4BPhi-4-mini

A Practical Decision Tree

🔑

No teacher access?

Train on API-generated outputs (SeqKD). Get GPT-4 or Claude to generate high-quality responses for your task. Fine-tune your student on those. Not distillation in the strict sense — but often the most practical option.

🧮

Need reasoning?

CoT distillation is the current answer. Use a frontier model (R1, Claude 3.7, Gemini 2.5 Pro) with extended thinking to generate reasoning traces on your task domain. Fine-tune on (problem, trace, answer) triplets. Expect +10–30% on reasoning benchmarks vs standard fine-tuning.

Open-weight teacher?

Use token-level KD — log-probability matching at each position. Combine with feature distillation if you can afford the memory. Run teacher and student on the same GPU with mixed precision and gradient checkpointing.

📱

Targeting edge/mobile?

Distillation + quantization pipeline. Distill to 1–3B first. Then QLoRA + QAT-KD to recover accuracy after int4 quantization. Export to GGUF Q4_K_M for llama.cpp. Target 500MB–1GB total model size.

🔬

My take: Distillation has moved from a compression technique to a capability transfer technique. The DeepSeek-R1 result changes the calculus entirely — you don't need to train a reasoning model from scratch with expensive RL. You need a frontier teacher that can reason, and enough high-quality examples of that reasoning to fine-tune a small model that mimics it. The next five years will be about systematically mapping what capabilities transfer efficiently and what requires the student to have genuine independent capacity. My bet: factual recall transfers poorly (the student needs its own parametric memory). Reasoning patterns transfer extremely well (the student learns algorithms, not facts).