Why does any of this matter?
ChatGPT, Claude, Gemini — these aren't programs with rules. They learned everything they know from data. Understanding how that learning works helps you build with AI more intelligently, evaluate it more honestly, and predict what it can and can't do.
This guide walks through every major concept in plain English — from what a "parameter" actually is, to why training needs a GPU cluster, to the specific technique (DPO) that makes a raw model safe and helpful.
Billions of Dials
An LLM is a machine with billions of adjustable numbers called parameters. Training means turning those dials until predictions get very, very good.
Learned from Data
No rules were written. The model reads trillions of words and discovers language, facts, and reasoning entirely on its own through prediction.
Gradient Descent
The core loop — predict, measure error, adjust. Repeat billions of times. Every technique in this guide builds on this one single idea.
Three Stages
Pre-training builds raw intelligence. Fine-tuning shapes it into an assistant. Alignment makes it genuinely helpful and safe for real users.
GPU-Powered
Training needs parallel math at massive scale. GPUs run thousands of operations simultaneously — making them 100× faster than CPUs for AI workloads.
Efficient Fine-tuning
LoRA lets anyone fine-tune a 70B model on a single consumer GPU — by training only 0.1% of the parameters while freezing the rest.
What is an LLM?
LLM stands for Large Language Model. At its core, it's a massive mathematical system — billions of numbers called parameters or weights — that learned to predict and generate text one word at a time. These aren't rules someone wrote. They're values the model discovered on its own by reading enormous amounts of text.
Imagine a chef who has read every cookbook ever written — millions of recipes. When you say "warm, Italian, comforting, pasta…" they complete the dish from memory. They don't look it up. They just know from patterns absorbed over a lifetime of reading.
An LLM is that chef, but for words, code, and ideas. It has read practically the entire internet.
Pre-training → Fine-tuning → Alignment
A raw LLM doesn't start as an assistant. It goes through three distinct training phases — each building on the last — before it's ready to talk to users.
| Stage | What happens | Cost / Time | Result |
|---|---|---|---|
| Pre-training | Read trillions of words. Predict next token. Learn language, facts, reasoning from scratch. | $10M–$100M · Months | Smart but raw — generates text, doesn't behave helpfully |
| Fine-tuning (SFT) | Train on expert-written (question, great answer) pairs. Model learns to be an assistant. | Cheap · Days | Conversational assistant that follows instructions |
| Alignment (DPO) | Show (better response, worse response) pairs. Train model to consistently prefer the better one. | Cheap · Days | Helpful, honest, safe — ready for production |
Pre-training = years of general fitness. You become physically capable of anything.
Fine-tuning = sport-specific training. The swimmer trains for swimming, not everything.
Alignment = learning the rules and sportsmanship. Same body, now competing in one specific game.
Gradient Descent
Every time the model makes a prediction during training, something watches and asks: "How wrong were you?" That measurement is called the loss. The entire goal of training is to make the loss as small as possible — across billions of predictions.
You're blindfolded in a hilly landscape, trying to reach the lowest valley (lowest loss). You can only feel the slope under your feet. Each step, you move in whichever direction feels downhill.
That's gradient descent. Follow the slope, one step at a time, until you reach the bottom. The billions of training steps are the billions of footsteps toward the valley.
Learning Rate
After backpropagation tells the model which direction to nudge each weight, the learning rate decides how far to nudge it. It's a tiny number — often 0.00003 — but it has an outsized effect on whether training succeeds or collapses entirely.
The learning rate is how much you turn the tuning peg. Too little — you'll be there all day, barely changing the pitch. Too much — you overshoot sharp, then flat, then snap the string entirely. The right amount gets the string perfectly in tune, smoothly and quickly.
Learning Rate Warmup
At the very start of training, the model's internal signals (gradients) are large and chaotic. A high learning rate during this fragile phase can destroy the model before it learns anything useful. Warmup solves this by starting the learning rate near zero and gradually raising it over the first few hundred steps.
On a cold morning, you don't floor the accelerator the moment you turn the key. The engine needs a minute to reach operating temperature before you can push it hard.
Neural networks are the same. The training signals need a moment to stabilize before the full learning rate kicks in.
DPO — Direct Preference Optimization
After pre-training, a model is smart but cold. Ask it "I'm feeling sad" and it might output a Wikipedia entry on clinical depression. Technically relevant. Not what you wanted. Alignment fixes this — instead of writing rules, we show the model thousands of (better response, worse response) pairs and train it to prefer the better ones.
Prompt: "I'm feeling really sad today."
"I'm sorry to hear that. Do you want to talk about what's going on? Sometimes just putting it into words helps."
Prompt: "I'm feeling really sad today."
"Sadness is a normal human emotion caused by neurotransmitter imbalances and situational stressors."
A great teacher shows you two essays — one excellent, one mediocre — and says "this one is better." After hundreds of examples, you develop taste. DPO is that process at scale: thousands of preference pairs, training the model's judgment directly.
DPO vs RLHF — What Changed
| Feature | RLHF (old way) | DPO (new way) |
|---|---|---|
| Mechanism | Train a separate reward model, then optimize against it | Directly train on preference pairs — no judge model needed |
| Complexity | Two stages, two models | One stage, one model |
| Stability | Prone to instability | Stable and predictable |
| Used by | Early OpenAI, early Anthropic | Most labs since 2023 |
LoRA & QLoRA
Full fine-tuning a 70B model updates all 70 billion weights — requiring ~140 GB of GPU memory. LoRA cuts this to ~6 GB by adding tiny adapter matrices (0.1% the size) and only training those, keeping the original weights frozen. QLoRA adds 4-bit compression of the base model, enabling 70B fine-tuning on a single consumer RTX 4090.
Instead of rewriting an entire textbook, you add tiny Post-It notes in the margins. The original pages stay untouched. LoRA works the same way — tiny adapters modify the model's behavior without touching the base weights.
At the end, the notes merge back in seamlessly. Nobody can tell the difference.
The Supporting Cast
AdamW — The Smart Optimizer
Gradient descent says "move this way." AdamW gives each individual weight its own adaptive step size — weights that barely change get bigger nudges; fast-changing ones get smaller nudges. The "W" means weight decay: gentle pressure keeping weights small and preventing overfitting. Like a personal trainer adjusting each athlete's load individually instead of giving everyone the same workout.
Dropout — Preventing Memorization
Dropout randomly switches off 10% of neurons each training step, so the model can't over-rely on any single path. It's forced to learn general, robust patterns instead of memorizing specific examples. Like a student who can't use memorized sentences on the exam and has to actually understand the material.
BF16 — Free Speed
BF16 stores weights as 16-bit numbers instead of 32-bit — halving memory and speeding training 2–4× on modern GPUs with almost no quality loss. Nearly all LLM training today uses BF16. It's preferred over FP16 because it handles the extreme values that neural networks produce more gracefully.
Batch Size & Gradient Accumulation
Instead of updating weights after every single example, the model processes a batch of 32–256 at once and averages their gradients. When GPU memory is limited, gradient accumulation simulates a large batch: run 8 small passes, accumulate the gradients, then update once — same result, no extra memory required.
GPUs — Why Training Needs Special Hardware
Training involves multiplying enormous grids of numbers together, over and over. A CPU does this one at a time — very fast. A GPU has thousands of smaller cores that do it all simultaneously — making it 100× faster for this specific type of work.
A CPU is a single brilliant mathematician solving complex problems one at a time. A GPU is a stadium of school kids, each doing one simple multiplication at the exact same moment.
For AI math — multiplying millions of numbers in parallel — the stadium wins by 100×.
VRAM — The Real Bottleneck
VRAM is the on-chip memory of the GPU. A 7B model in BF16 needs ~14GB just to load — before training overhead. A 70B model needs ~140GB. When people say "my GPU ran out of memory," they mean VRAM. This is exactly why LoRA and quantization exist — to squeeze models into the VRAM you actually have.
Key Insights from LLM Training
Data quality beats quantity
A model trained on 2T clean tokens often outperforms one trained on 10T noisy tokens. Curation and filtering matter more than raw scale.
The tokenizer is architecture
A well-designed tokenizer (1–2 tokens per word vs 4–8) gives the model more signal per parameter and dramatically improves training efficiency.
Alignment is learnable taste
Safety and helpfulness aren't rules you write. They're preferences the model learns from examples — and DPO has made this process remarkably stable.
LoRA democratizes fine-tuning
A developer with one consumer GPU can now fine-tune a 70B model. The power gap between labs and individuals has never been smaller.
Warmup shape beats LR value
A well-shaped warmup + cosine decay schedule is more robust than a precisely-tuned static learning rate. The schedule is the real knob.
RL is the new frontier
Reinforcement learning during post-training — GRPO and variants — is where the biggest capability gains are happening right now.
The Modern Training Pipeline
From raw data to production deployment — step by step
Collect & Clean Data
Data EngineeringScrape the web, books, code, and papers. Filter garbage. Deduplicate aggressively. Data quality matters more than almost anything else.
- Web scraping and Common Crawl processing
- Language detection and quality filtering
- Deduplication using MinHash or exact matching
- Toxic content filtering and removal
- Custom data mixtures tuned via ablations
Pre-train on a GPU Cluster
Pre-training · Weeks to MonthsPredict the next word, billions of times, across thousands of GPUs. The model reads everything and learns language, reasoning, and world knowledge from prediction alone.
- AdamW optimizer with β1=0.9, β2=0.95
- BF16 mixed precision throughout
- LR warmup (1–5%) → stable → cosine decay
- Gradient clipping to prevent explosion
- Distributed across 1,000–10,000 H100s
Supervised Fine-Tuning (SFT)
Fine-tuning · DaysTrain on thousands of high-quality (prompt, ideal response) pairs. The model learns to follow instructions. Usually uses LoRA to save compute.
- Human-written (instruction, response) pairs
- Diverse tasks: summarization, coding, reasoning
- LoRA adapters trained at rank 16–64
- Lower learning rate than pre-training
- 1–3 epochs maximum to avoid overfitting
Alignment via DPO or RLHF
Alignment · DaysCollect human preference data. Apply DPO to make the model lean toward genuinely helpful, honest, safe answers. Multiple rounds targeting different failure modes.
- Human annotators rank response pairs
- DPO loss applied on preference data directly
- Multiple rounds targeting different failure modes
- Safety evaluations between each round
- Some labs use GRPO/PPO for reasoning models
Evaluate & Red-Team
Evaluation · OngoingRun through thousands of benchmarks and adversarial prompts. Find failure modes. Fix with targeted fine-tuning. Repeat until ready.
- MMLU, HellaSwag, TruthfulQA, HumanEval
- Adversarial red-teaming for safety failures
- Human preference A/B tests
- Domain evals: medical, legal, code
- Regression testing vs previous checkpoints
Quantize & Deploy
Inference · ProductionCompress weights to 4-bit or 8-bit for fast, cheap inference. The model millions of users interact with is almost always quantized — often to INT8 or NVFP4 on modern hardware.
- INT8 / INT4 quantization for inference speed
- vLLM or SGLang for efficient batched serving
- KV cache management for long contexts
- Speculative decoding for latency reduction
- Served via REST API at scale
Every term in one place.