Knowledge Distillation in LLMs

The Big Picture

A large model contains more knowledge
than its outputs reveal

When GPT-4 predicts the next token, it doesn't just output the single most probable word. It outputs a probability distribution over the entire vocabulary — 100,000+ numbers, each one reflecting some learned relationship about language. A small model trained only on labels ("the correct answer is cat") misses all of that richness. Distillation is how you transfer it.

Geoffrey Hinton coined the term in 2015. The insight was deceptively simple: train the student on the teacher's soft probability outputs, not just the hard correct labels. Those soft probabilities encode the teacher's internal beliefs about similarity, ambiguity, and structure — far richer than a binary right/wrong signal.

🌡️

Temperature Scaling

Softening the teacher's output distribution with a temperature parameter T reveals probability mass placed on "wrong but similar" answers — the hidden knowledge.

🎓

Response Distillation

Student learns from the teacher's final output probabilities (logits). The most common form. Works black-box — you only need teacher inference, not internals.

🧠

Feature Distillation

Student mimics the teacher's intermediate hidden representations, not just the output. Richer signal but requires white-box access to the teacher's architecture.

🔗

Relation Distillation

Student learns relationships between examples — the structure of the teacher's embedding space. Useful for metric learning and cross-modal transfer.

Hinton's Original Insight

Distilling the Knowledge in a Neural Network, 2015

Hinton, Vinyals, Dean NeurIPS 2015 Foundational paper

Before Hinton, model compression was largely about pruning, quantization, and architectural redesign — mechanical shrinkage. The distillation paper proposed something different: the large model itself is the curriculum. Instead of discarding the teacher's knowledge, you use it as a richer training signal for the student.

The key observation: when a well-trained model predicts a cat image, it doesn't just say "cat: 99%". It says something like "cat: 99%, leopard: 0.7%, dog: 0.2%, tiger: 0.05%". Those tiny probabilities on "wrong" labels encode the model's understanding that cats are more similar to leopards than to cars. That structure is gold — and a hard label "cat: 1, everything else: 0" throws it all away.

💡

The "dark knowledge" concept: Hinton called the information encoded in near-zero probabilities dark knowledge. A model trained on MNIST might assign 10⁻⁶ probability to "2" when the input is clearly a "3" — but that tiny value still encodes the fact that 2 and 3 are more similar than 2 and 7. Over millions of examples, this structure teaches the student far more than binary labels ever could.

The Distillation Loss

Distillation Objective — Combined Loss

Hard loss — student vs ground truth labels:
L_hard = CrossEntropy( student_logits, true_labels )

Soft loss — student vs teacher soft targets (at temperature T):
L_soft = T² · KL( σ(z_s/T) ‖ σ(z_t/T) )

Combined objective (α controls the balance):
L = (1−α) · L_hard + α · L_soft

where z_s = student logits
       z_t = teacher logits (frozen, no gradient)
       σ(z/T) = softmax with temperature T
       T² factor restores gradient magnitude lost by softening

The T² factor is subtle but important. When you divide logits by T before softmax, the gradient of the soft loss is scaled down by 1/T². Multiplying by T² cancels this out — ensuring the soft and hard losses contribute at comparable magnitudes regardless of temperature. Hinton specifically calls this out and it's often missed in implementations.

Standard Distillation Setup

TEACHER  (frozen)

Large Model (e.g. GPT-4-70B)

↓

Logits z_t (÷ T)

↓

Soft probabilities p_t

STUDENT  (training)

Small Model (e.g. 1.7B)

↓

Logits z_s (÷ T)

↓

Soft probabilities p_s

KL(p_s ‖ p_t) · T²

CrossEntropy(z_s, y_true)

L_distill → backprop into student only

Why KL Divergence, Not MSE?

You could minimize MSE between teacher and student logits directly. But KL divergence treats the teacher's distribution as a probability distribution and measures how much information the student loses. It asymmetrically penalizes the student for assigning low probability where the teacher assigned high probability — which is exactly what you want. MSE treats all errors equally regardless of probability mass, making it a worse fit for this task.

The Temperature Trick

How to make a model reveal what it almost said

T = 1.0 T → ∞ : uniform T → 0 : one-hot

The softmax function converts raw logits into probabilities. Normally you just compute exp(z_i) / Σ exp(z_j). Temperature scaling adds one parameter T that controls how "peaked" or "flat" the resulting distribution is:

Temperature-Scaled Softmax

p_i(T) = exp(z_i / T) / Σ_j exp(z_j / T)

T = 1.0 → standard softmax (peaked distribution)
T → ∞ → uniform distribution (maximum entropy)
T → 0 → one-hot argmax (hard labels)
T = 4–8 → typical distillation range (soft but informative)

Temperature Effect on Soft Labels T = 1.0

Teacher's logit distribution

Softmax output at temperature T

← T=0.2 (near hard) T=10 (near uniform) →

Notice how at high T, the model reveals its beliefs about "cat vs leopard" similarity. At T=1, that information is almost entirely hidden in near-zero probability mass.

In practice, the teacher and student both use the same temperature T during distillation training. At inference time, you set T = 1.0 for the deployed student model — the temperature was only a training tool to enrich the supervision signal.

Choosing Temperature in Practice

Too low (T < 1): barely different from hard labels. The soft distribution is still very peaked around the argmax. Too high (T > 20): the distribution becomes so uniform it's nearly random noise — no useful signal. T = 3–5 is the standard starting point for most LLM distillation tasks. For tasks with high semantic overlap between classes (like token-level language modeling where many tokens are nearly equivalent), higher T (6–10) often works better.

Three Flavors of Knowledge

Response, Feature, and Relation — what exactly gets transferred

Response-Based Distillation

The student is trained to match the teacher's output layer predictions — logits or probabilities. This is the original Hinton formulation and remains the most common form, especially for LLMs where "output" means the full vocabulary distribution at each position.

Response-Based — Output Matching

TEACHER

Layer N (output)

…

Layer 2

Layer 1

Input tokens

KL Divergence

⟵⟶

output only

STUDENT

Layer M (output)

…

Layer 2

Layer 1

Input tokens

Feature-Based Distillation

Instead of just matching the final output, the student is trained to match the teacher's intermediate hidden representations. Each transformer layer produces a hidden state tensor. Feature distillation adds auxiliary losses that penalize differences between corresponding hidden states in teacher and student.

This requires a white-box teacher — you need access to internal activations. For GPT-4 or Claude, this is impossible. For open models like Llama or Mistral, it's entirely feasible. The tradeoff: richer signal at the cost of requiring teacher internals and often needing projection layers (since teacher and student hidden dimensions differ).

Feature Distillation Loss — FitNets / PKD Style

L_feat = Σ_{l ∈ mapped_layers} ‖ h_s^l − W · h_t^l ‖²

where h_s^l = student hidden state at layer l (dim d_s)
h_t^l = teacher hidden state at mapped layer (dim d_t)
W = learned linear projection (d_s → d_t)

Common layer mappings (24-layer teacher → 6-layer student):
t_layer 4 ↔ s_layer 1
t_layer 8 ↔ s_layer 2
t_layer 12 ↔ s_layer 3 ← DistilBERT uses this pattern

Attention Map Distillation

A variant of feature distillation: instead of matching hidden state activations, match the attention weight matrices across layers. The teacher's attention patterns encode which tokens it decided mattered for each prediction. Transferring these patterns teaches the student where to look, not just what to output. Used in TinyBERT and PKD-BERT — often more effective than hidden state matching for downstream task performance.

Relation-Based Distillation

Instead of matching individual outputs or activations, relation-based distillation matches relationships between examples. For a batch of N examples, you compute an N×N similarity matrix from both teacher and student embeddings and train the student to reproduce the teacher's similarity structure.

This is especially powerful for embedding models and retrieval tasks — it teaches the student to place similar concepts near each other in representation space, even if the absolute coordinates differ. Used extensively in contrastive learning and cross-modal distillation (e.g. distilling CLIP into smaller vision encoders).

Distillation at Scale

New problems that emerge when the teacher has 100B+ parameters

Black-box teachers Autoregressive KD Chain-of-thought

Classic Hinton distillation was designed for classification — a fixed-size logit vector. LLMs generate sequences — the output grows token by token, and there's no natural "final output" to match. This creates new challenges, new loss formulations, and new techniques specific to the LLM regime.

Problem 1 — Sequence-Level vs Token-Level KD

Token-level KD (the straightforward extension): for each position in the sequence, compute KL divergence between teacher and student token distributions. This is efficient and directly analogous to Hinton's original method. It's what most open-source distillation implementations do.

Sequence-level KD (SeqKD, Kim & Rush 2016): instead of matching token distributions at each step, first generate complete sequences from the teacher, then train the student on those generated sequences using standard cross-entropy. Simpler — no need to run teacher and student simultaneously — but loses the soft distribution signal, only transferring the teacher's greedy decodes.

Token-Level vs Sequence-Level KD

Token-level KD (forward KL at each position t):
L_TKD = Σ_t=1^T KL( p_teacher(·|x, y_<t) ‖ p_student(·|x, y_<t) )

Sequence-level KD (train on teacher outputs directly):
L_SKD = −Σ_t=1^T log p_student( ŷ_t | x, ŷ_<t )
where ŷ = sequences sampled from teacher

Problem 2 — Exposure Bias & the Forward/Reverse KL Gap

There's a subtle but important asymmetry. Forward KL (teacher ‖ student) forces the student to cover all modes the teacher places probability on — even if the student can't model them well. Reverse KL (student ‖ teacher) makes the student seek the sharpest mode-matching fit, often collapsing to covering only the most probable mode.

MiniLLM (Gu et al., 2023) showed that minimizing reverse KL for text generation produces better quality than forward KL, because reverse KL avoids the problem of the student assigning high probability to long-tail sequences the teacher would never produce. The student learns to be sharp and precise rather than trying (and failing) to be comprehensive.

📐

Forward vs Reverse KL Intuition: Imagine the teacher distribution has two modes — "The cat sat" and "A feline rested". Forward KL forces the student to cover both modes, splitting probability mass and producing blurry, averaged outputs. Reverse KL lets the student pick one mode and be very confident about it. For text generation quality, concentrated confidence beats diluted coverage. MiniLLM showed 2–10 point win rates over standard forward-KL distillation across benchmarks.

Problem 3 — Chain-of-Thought Distillation

The most powerful capability of large models isn't their one-shot answers — it's their ability to reason through problems step by step. Chain-of-thought (CoT) distillation transfers this reasoning ability to smaller models by including the teacher's reasoning chains in the training data.

The pipeline: run the teacher (GPT-4, Claude Opus) on reasoning tasks with CoT prompting. Collect the full reasoning traces — not just final answers. Fine-tune the student on (problem, reasoning trace + answer) pairs. The student learns not just what the answer is, but how to get there.

DeepSeek-R1 — The Landmark CoT Distillation

DeepSeek-R1 (2025) is the most dramatic demonstration of CoT distillation to date. The full R1 model uses GRPO reinforcement learning to develop long chain-of-thought reasoning. Then, 800K high-quality CoT examples from R1 are used to fine-tune smaller open-source models (Llama-3.1-8B, Qwen-2.5-7B, etc.). The resulting distilled models match or beat GPT-4o on math and coding benchmarks at 8B parameters — a compression ratio that would have seemed impossible two years ago. The key: the teacher's reasoning chains are so rich that the student doesn't need to rediscover how to reason from scratch.

Problem 4 — Black-Box vs White-Box Teachers

Approach	Teacher Access Needed	Signal Quality	Examples
SeqKD / Data Augmentation	Outputs (API access)	Low — hard labels only	Alpaca, WizardLM, Orca
Token-Level KD	Output logits	Medium — soft labels	DistilBERT, TinyLLaMA
Feature KD	Internal activations	High — representations	TinyBERT, PKD-BERT
CoT Distillation	Outputs (API access)	Very High — reasoning traces	DeepSeek-R1-Distill, Phi-3
Online KD	Both models active simultaneously	Highest — dynamic alignment	Research setting

The Loss Cookbook

Every distillation method boils down to what you're minimizing

Hard vs Soft Loss Balance (α) α = 0.7

Soft Loss (α·L_KD)

Hard Loss ((1−α)·CE)

70% soft 30% hard

α = 0 → standard supervised training (no distillation). α = 1 → pure teacher imitation (no ground truth). α = 0.5–0.9 → typical distillation range.

DistiLLM — Generalized Distillation for LLMs

DistiLLM (Ko et al., 2024) unifies multiple loss functions under one framework and proposes a novel "skew" KL divergence that sits between forward and reverse KL, combining their strengths:

DistiLLM — Skew KL Divergence

L_skewKL(β) = KL( β·p_s + (1-β)·p_t ‖ p_t )

β = 0 → forward KL (covers all teacher modes)
β = 1 → reverse KL (seeks sharpest mode)
β = 0.1 → DistiLLM default (best empirical performance)

At β=0.1, the mixture distribution is mostly teacher — so the
loss is a mild variant of forward KL, but with reduced mode-covering
pressure that avoids blurry outputs.

Speculative Decoding as Implicit Distillation

This connection is underappreciated. In speculative decoding, a small draft model proposes tokens, and a large verification model accepts or rejects them. Over training time, if you train the draft model to maximize acceptance rate from the verifier, you're essentially doing reverse-KL distillation — the draft learns to match the distribution the verifier would accept.

Models like Medusa and Eagle make this explicit: they train draft heads using the large model's hidden states as supervision. The boundary between "distillation" and "speculative decoding training" is almost nonexistent once you look at the loss functions.

The Self-Distillation Loop

An emerging technique: have a model distill from itself. Take a model's high-temperature samples (exploratory, diverse outputs) and use them as training signal for the same model at low temperature (more deterministic). This iterative self-improvement tightens the model's distribution without any external teacher. Used in LLM constitutional AI pipelines and increasingly in RLHF alternatives.

Model Gallery

From DistilBERT to DeepSeek-R1

Compression Ratio vs Performance Retention

Encoder (BERT-family)

Decoder / LLM

CoT Distilled

X-axis: compression ratio (teacher size / student size). Y-axis: % of teacher benchmark score retained. Bubble size = student parameter count.

Model	Teacher	Student Size	Compression	Distillation Type	Key Result
DistilBERT	BERT-Base (110M)	66M	1.67×	Response + Feature	97% of BERT on GLUE, 60% faster
TinyBERT	BERT-Base (110M)	14.5M	7.5×	Attention + Feature	96.8% of BERT-Base, 9.4× faster
MiniLM	BERT-Large	22M	16×	Attention relation	99%+ on MNLI, SQUAD
TinyLLaMA	Llama-2 (7B)	1.1B	6.4×	SeqKD + token-KD	Competitive on commonsense benchmarks
Phi-1	GPT-3.5 (API)	1.3B	~100×	CoT / Data Distillation	50.6% HumanEval — matches Codex
Phi-2	Multiple (API)	2.7B	~25×	Curated synthetic data	Beats Llama-2-13B on most benchmarks
DeepSeek-R1-Distill-8B	DeepSeek-R1 (671B)	8B	84×	CoT reasoning traces	Matches o1-mini on MATH-500
DeepSeek-R1-Distill-70B	DeepSeek-R1 (671B)	70B	9.6×	CoT reasoning traces	85.0% MATH-500, beats GPT-4o

DeepSeek-R1 Distillation — A Close Read

DeepSeek-R1 deserves special attention because it represents a qualitative jump in what distillation can achieve. The recipe:

DeepSeek-R1 Distillation Pipeline

DeepSeek-R1 (671B MoE) — GRPO-trained reasoning model

↓

Generate 800K CoT examples across math, code, science, logic

Filter: remove incorrect answers, too-short chains, format violations

↓

~600K high-quality (problem, thinking_process, answer) triplets

SFT fine-tuning on Qwen-2.5-7B / Llama-3.1-8B / Llama-3.3-70B

↓

R1-Distill-7B

R1-Distill-8B

R1-Distill-14B

R1-Distill-70B

🔬

Why these numbers are extraordinary: DeepSeek-R1-Distill-8B achieves 72.6% on AIME 2024 and 89.5% on MATH-500. OpenAI's o1-mini achieves 70% on AIME and 90% on MATH-500. A 671B teacher just transferred near-frontier reasoning ability into an 8B model — an 84× compression — via pure supervised fine-tuning on CoT data, no RL required. This is the most significant distillation result in the history of the field.

The key insight: reasoning ability is highly transferable via examples once you have a teacher that can demonstrate it. The student doesn't need to rediscover how to do chain-of-thought — it learns the pattern from examples the same way humans learn proof techniques from worked examples. The bottleneck was never the student's capacity; it was never having a teacher good enough to demonstrate reasoning in the first place.

Patterns That Keep Appearing

What the literature tells us

Data Quality > Data Quantity

Phi-1 (1.3B) showed that 7B of textbook-quality synthetic data beats 1T tokens of web scrapes for coding. Distillation that curates 100K excellent teacher examples consistently beats distillation on 10M mediocre ones. The teacher's quality is the ceiling; curation determines how close you get to it.

CoT Traces Transfer Reasoning

Giving the student a finished answer is far less valuable than giving it the reasoning path. Models trained on reasoning traces show strong generalization to novel problems in the same domain — they've learned a process, not just an answer bank. This doesn't work with black-box label distillation.

The 4× Rule of Thumb

Empirically, a well-distilled student at N parameters tends to match a non-distilled model at 4N parameters. This varies widely by task and distillation quality, but is a useful starting point when deciding whether distillation is worth the engineering investment vs just training a larger student from scratch.

Task Gap Kills Distillation

Distillation works best when teacher and student see the same or very similar training distribution. Using GPT-4 as a math teacher for a student trained on general web text transfers poorly — the student lacks the base representations to absorb the teacher's reasoning signals. Domain alignment first, distillation second.

Capacity Gap Matters

Too large a teacher-student gap hurts. A student at 0.1× the teacher's capacity struggles to absorb the full distribution — the KL divergence loss becomes dominated by inexpressible probability mass. 0.1×–0.3× capacity ratio is the empirical sweet spot for feature distillation; CoT distillation tolerates larger gaps.

Combine with Quantization

Distillation → quantization is a natural pipeline. Distillation gives you a small model. Quantization makes it fast. Distillation after quantization (QAT-KD) — training the quantized student to recover quality from the full-precision teacher — often recovers most of the accuracy lost from int8/int4 quantization.

History

How Distillation Evolved

2006

Model Compression — Bucilua et al.

First paper on training a small model on soft outputs of a large ensemble. The precursor idea — but without temperature scaling or the modern formulation.

Ensemble Compression

Mar 2015

Distilling the Knowledge in a Neural Network

Hinton, Vinyals, Dean formalize distillation with temperature scaling, the T² correction, and the combined hard/soft loss. The foundational paper. Still the starting point for every practitioner.

KDTemperatureDark Knowledge

Oct 2016

FitNets — Feature-Based Distillation

Romero et al. show students can be trained to mimic intermediate hidden representations (hints), enabling deeper/thinner students to outperform shallower/wider ones.

FitNetsFeature KD

Oct 2019

DistilBERT & TinyBERT

First successful application to BERT-scale transformers. DistilBERT uses response + feature KD. TinyBERT adds attention matrix distillation. Both become production standards.

DistilBERTTinyBERT

Jun 2023

MiniLLM — Reverse KL for Generation

Shows that minimizing reverse KL (student ‖ teacher) outperforms forward KL for autoregressive generation. Reframes the loss function design space for the LLM era.

MiniLLMReverse KL

Dec 2023

Phi-2 — Textbook-Quality Data Wins

Microsoft shows 2.7B model matches 7B+ models via curated synthetic training data from GPT-4. Redefines what "distillation" means — it's not just about loss functions but data curation.

Phi-2Synthetic Data

Apr 2024

DistiLLM — Skew KL Unification

Proposes skew KL divergence as a unified objective that interpolates between forward and reverse KL. Achieves consistent wins across model sizes and tasks.

DistiLLMSkew KL

Jan 2025

DeepSeek-R1 Distillation

800K reasoning traces from R1 fine-tune Llama and Qwen base models. R1-Distill-8B matches o1-mini. Proves CoT distillation can transfer frontier reasoning at 84× compression.

DeepSeek-R1CoT KDReasoning

2025+

The Reasoning Distillation Era

Every frontier lab now has distilled reasoning variants. The question shifts from "can we distill reasoning?" to "how much reasoning can we pack into 1B parameters?"

Qwen-DistillGemma-3-4BPhi-4-mini

Final Word

A Practical Decision Tree

🔑

No teacher access?

Train on API-generated outputs (SeqKD). Get GPT-4 or Claude to generate high-quality responses for your task. Fine-tune your student on those. Not distillation in the strict sense — but often the most practical option.

🧮

Need reasoning?

CoT distillation is the current answer. Use a frontier model (R1, Claude 3.7, Gemini 2.5 Pro) with extended thinking to generate reasoning traces on your task domain. Fine-tune on (problem, trace, answer) triplets. Expect +10–30% on reasoning benchmarks vs standard fine-tuning.

⚡

Open-weight teacher?

Use token-level KD — log-probability matching at each position. Combine with feature distillation if you can afford the memory. Run teacher and student on the same GPU with mixed precision and gradient checkpointing.

📱

Targeting edge/mobile?

Distillation + quantization pipeline. Distill to 1–3B first. Then QLoRA + QAT-KD to recover accuracy after int4 quantization. Export to GGUF Q4_K_M for llama.cpp. Target 500MB–1GB total model size.

🔬

My take: Distillation has moved from a compression technique to a capability transfer technique. The DeepSeek-R1 result changes the calculus entirely — you don't need to train a reasoning model from scratch with expensive RL. You need a frontier teacher that can reason, and enough high-quality examples of that reasoning to fine-tune a small model that mimics it. The next five years will be about systematically mapping what capabilities transfer efficiently and what requires the student to have genuine independent capacity. My bet: factual recall transfers poorly (the student needs its own parametric memory). Reasoning patterns transfer extremely well (the student learns algorithms, not facts).