Machine Learning · Plain English · Complete Guide

The Brain Behind ChatGPT: How AI Actually Learns

From a blank model to production AI — a plain-English deep dive into every concept, technique, and piece of hardware behind the models reshaping the world.

Published March 2026 · Prateek Singh, PhD · ~18 min read

What is an LLM Learning Rate Warmup DPO LoRA GPUs Scroll
Overview

Why does any of this matter?

ChatGPT, Claude, Gemini — these aren't programs with rules. They learned everything they know from data. Understanding how that learning works helps you build with AI more intelligently, evaluate it more honestly, and predict what it can and can't do.

This guide walks through every major concept in plain English — from what a "parameter" actually is, to why training needs a GPU cluster, to the specific technique (DPO) that makes a raw model safe and helpful.

🧠

Billions of Dials

An LLM is a machine with billions of adjustable numbers called parameters. Training means turning those dials until predictions get very, very good.

📚

Learned from Data

No rules were written. The model reads trillions of words and discovers language, facts, and reasoning entirely on its own through prediction.

🔁

Gradient Descent

The core loop — predict, measure error, adjust. Repeat billions of times. Every technique in this guide builds on this one single idea.

🎯

Three Stages

Pre-training builds raw intelligence. Fine-tuning shapes it into an assistant. Alignment makes it genuinely helpful and safe for real users.

GPU-Powered

Training needs parallel math at massive scale. GPUs run thousands of operations simultaneously — making them 100× faster than CPUs for AI workloads.

🔓

Efficient Fine-tuning

LoRA lets anyone fine-tune a 70B model on a single consumer GPU — by training only 0.1% of the parameters while freezing the rest.

01 — The Basics
01

What is an LLM?

Large Language Model — the technology inside ChatGPT, Claude, and Gemini
ParametersWeightsNext-token Prediction

LLM stands for Large Language Model. At its core, it's a massive mathematical system — billions of numbers called parameters or weights — that learned to predict and generate text one word at a time. These aren't rules someone wrote. They're values the model discovered on its own by reading enormous amounts of text.

🍳 Analogy — The Chef

Imagine a chef who has read every cookbook ever written — millions of recipes. When you say "warm, Italian, comforting, pasta…" they complete the dish from memory. They don't look it up. They just know from patterns absorbed over a lifetime of reading.

An LLM is that chef, but for words, code, and ideas. It has read practically the entire internet.

What happens when you type to an AI
You type: "The sky is ___"
Split into tokens
Pass through billions of weights
Processed signal
Score every possible next word
"blue" wins at 82% → Output
02 — The Three Stages
02

Pre-training → Fine-tuning → Alignment

Every modern AI assistant has been through all three of these phases
Pre-trainingFine-tuningAlignment

A raw LLM doesn't start as an assistant. It goes through three distinct training phases — each building on the last — before it's ready to talk to users.

StageWhat happensCost / TimeResult
Pre-trainingRead trillions of words. Predict next token. Learn language, facts, reasoning from scratch.$10M–$100M · MonthsSmart but raw — generates text, doesn't behave helpfully
Fine-tuning (SFT)Train on expert-written (question, great answer) pairs. Model learns to be an assistant.Cheap · DaysConversational assistant that follows instructions
Alignment (DPO)Show (better response, worse response) pairs. Train model to consistently prefer the better one.Cheap · DaysHelpful, honest, safe — ready for production
🏊 Analogy — The Athlete

Pre-training = years of general fitness. You become physically capable of anything.

Fine-tuning = sport-specific training. The swimmer trains for swimming, not everything.

Alignment = learning the rules and sportsmanship. Same body, now competing in one specific game.

03 — The Core Engine
03

Gradient Descent

The single idea underneath every AI training technique in the world
Forward PassLossBackpropagationWeight Update

Every time the model makes a prediction during training, something watches and asks: "How wrong were you?" That measurement is called the loss. The entire goal of training is to make the loss as small as possible — across billions of predictions.

The Training Loop — runs billions of times
① Feed text in
② Model predicts next word
③ Compare → Loss
④ Backpropagate — trace blame
⑤ Nudge weights to reduce loss
Repeat ↺
🏔️ Analogy — The Blindfolded Hiker

You're blindfolded in a hilly landscape, trying to reach the lowest valley (lowest loss). You can only feel the slope under your feet. Each step, you move in whichever direction feels downhill.

That's gradient descent. Follow the slope, one step at a time, until you reach the bottom. The billions of training steps are the billions of footsteps toward the valley.

04 — Most Important Hyperparameter
04

Learning Rate

Controls how big each training step is — the most impactful single knob
HyperparameterToo high → DivergenceToo low → StagnationTypical: 3e-5

After backpropagation tells the model which direction to nudge each weight, the learning rate decides how far to nudge it. It's a tiny number — often 0.00003 — but it has an outsized effect on whether training succeeds or collapses entirely.

Interactive — drag to see what different learning rates do to training
0.00003
Too low Just right Too high
🎸 Analogy — Guitar Tuning

The learning rate is how much you turn the tuning peg. Too little — you'll be there all day, barely changing the pitch. Too much — you overshoot sharp, then flat, then snap the string entirely. The right amount gets the string perfectly in tune, smoothly and quickly.

05 — Warmup
05

Learning Rate Warmup

Why you can't start training at full speed from step one
Gradual ramp-up1–5% of total stepsThen cosine decay

At the very start of training, the model's internal signals (gradients) are large and chaotic. A high learning rate during this fragile phase can destroy the model before it learns anything useful. Warmup solves this by starting the learning rate near zero and gradually raising it over the first few hundred steps.

Learning Rate Schedule — 3 Phases
Warmup — LR rises slowly Stable — full-speed training Cosine Decay — smooth slow-down
🚗 Analogy — Cold Car Engine

On a cold morning, you don't floor the accelerator the moment you turn the key. The engine needs a minute to reach operating temperature before you can push it hard.

Neural networks are the same. The training signals need a moment to stabilize before the full learning rate kicks in.

06 — Alignment
06

DPO — Direct Preference Optimization

How models learn to give answers humans actually want, not just technically correct ones
DPORLHFChosen vs RejectedPreference Data

After pre-training, a model is smart but cold. Ask it "I'm feeling sad" and it might output a Wikipedia entry on clinical depression. Technically relevant. Not what you wanted. Alignment fixes this — instead of writing rules, we show the model thousands of (better response, worse response) pairs and train it to prefer the better ones.

✓ Chosen — the better response

Prompt: "I'm feeling really sad today."

"I'm sorry to hear that. Do you want to talk about what's going on? Sometimes just putting it into words helps."

✗ Rejected — the worse response

Prompt: "I'm feeling really sad today."

"Sadness is a normal human emotion caused by neurotransmitter imbalances and situational stressors."

✍️ Analogy — The Writing Teacher

A great teacher shows you two essays — one excellent, one mediocre — and says "this one is better." After hundreds of examples, you develop taste. DPO is that process at scale: thousands of preference pairs, training the model's judgment directly.

DPO vs RLHF — What Changed

FeatureRLHF (old way)DPO (new way)
MechanismTrain a separate reward model, then optimize against itDirectly train on preference pairs — no judge model needed
ComplexityTwo stages, two modelsOne stage, one model
StabilityProne to instabilityStable and predictable
Used byEarly OpenAI, early AnthropicMost labs since 2023
07 — Efficient Fine-tuning
07

LoRA & QLoRA

Fine-tuning a 70B model on a laptop — without losing quality
LoRAQLoRA0.1% of paramsPEFT

Full fine-tuning a 70B model updates all 70 billion weights — requiring ~140 GB of GPU memory. LoRA cuts this to ~6 GB by adding tiny adapter matrices (0.1% the size) and only training those, keeping the original weights frozen. QLoRA adds 4-bit compression of the base model, enabling 70B fine-tuning on a single consumer RTX 4090.

📌 Analogy — Post-It Notes in a Textbook

Instead of rewriting an entire textbook, you add tiny Post-It notes in the margins. The original pages stay untouched. LoRA works the same way — tiny adapters modify the model's behavior without touching the base weights.

At the end, the notes merge back in seamlessly. Nobody can tell the difference.

LoRA vs Full Fine-tuning — what actually gets updated
Full fine-tuning
Update all 70B weights · ~140GB VRAM · Very slow · $$$
LoRA
Freeze 70B weights · Add 70M adapter params · Train only those
QLoRA
LoRA + compress base to 4-bit · Fits single consumer GPU · ~6GB
08 — Supporting Techniques
08

The Supporting Cast

Techniques running silently in the background of every single training run
AdamWDropoutBF16Batch Size

AdamW — The Smart Optimizer

Gradient descent says "move this way." AdamW gives each individual weight its own adaptive step size — weights that barely change get bigger nudges; fast-changing ones get smaller nudges. The "W" means weight decay: gentle pressure keeping weights small and preventing overfitting. Like a personal trainer adjusting each athlete's load individually instead of giving everyone the same workout.

Dropout — Preventing Memorization

Dropout randomly switches off 10% of neurons each training step, so the model can't over-rely on any single path. It's forced to learn general, robust patterns instead of memorizing specific examples. Like a student who can't use memorized sentences on the exam and has to actually understand the material.

BF16 — Free Speed

BF16 stores weights as 16-bit numbers instead of 32-bit — halving memory and speeding training 2–4× on modern GPUs with almost no quality loss. Nearly all LLM training today uses BF16. It's preferred over FP16 because it handles the extreme values that neural networks produce more gracefully.

Batch Size & Gradient Accumulation

Instead of updating weights after every single example, the model processes a batch of 32–256 at once and averages their gradients. When GPU memory is limited, gradient accumulation simulates a large batch: run 8 small passes, accumulate the gradients, then update once — same result, no extra memory required.

09 — Hardware
09

GPUs — Why Training Needs Special Hardware

A regular computer simply cannot do this — and here's exactly why
Parallel ComputingVRAMH100 · A100 · 4090

Training involves multiplying enormous grids of numbers together, over and over. A CPU does this one at a time — very fast. A GPU has thousands of smaller cores that do it all simultaneously — making it 100× faster for this specific type of work.

🏟️ Analogy — Stadium vs Genius

A CPU is a single brilliant mathematician solving complex problems one at a time. A GPU is a stadium of school kids, each doing one simple multiplication at the exact same moment.

For AI math — multiplying millions of numbers in parallel — the stadium wins by 100×.

🖥️
NVIDIA H100
Gold standard. Used by OpenAI, Google, Anthropic. ~$30K each.
NVIDIA A100
Previous generation workhorse. Still widely used for fine-tuning.
💻
RTX 4090
Consumer GPU. Good for LoRA fine-tuning of 7B–13B models. 24GB VRAM.
🌐
GPU Clusters
Large models use 1,000–10,000 GPUs in distributed training setups.

VRAM — The Real Bottleneck

VRAM is the on-chip memory of the GPU. A 7B model in BF16 needs ~14GB just to load — before training overhead. A 70B model needs ~140GB. When people say "my GPU ran out of memory," they mean VRAM. This is exactly why LoRA and quantization exist — to squeeze models into the VRAM you actually have.

What It Tells Us
10

Key Insights from LLM Training

What the field has learned — and what it tells us about how AI actually works
Design PrinciplesLessons
01

Data quality beats quantity

A model trained on 2T clean tokens often outperforms one trained on 10T noisy tokens. Curation and filtering matter more than raw scale.

02

The tokenizer is architecture

A well-designed tokenizer (1–2 tokens per word vs 4–8) gives the model more signal per parameter and dramatically improves training efficiency.

03

Alignment is learnable taste

Safety and helpfulness aren't rules you write. They're preferences the model learns from examples — and DPO has made this process remarkably stable.

04

LoRA democratizes fine-tuning

A developer with one consumer GPU can now fine-tune a 70B model. The power gap between labs and individuals has never been smaller.

05

Warmup shape beats LR value

A well-shaped warmup + cosine decay schedule is more robust than a precisely-tuned static learning rate. The schedule is the real knob.

06

RL is the new frontier

Reinforcement learning during post-training — GRPO and variants — is where the biggest capability gains are happening right now.

Chronology

The Modern Training Pipeline

From raw data to production deployment — step by step

Step 1

Collect & Clean Data

Data Engineering

Scrape the web, books, code, and papers. Filter garbage. Deduplicate aggressively. Data quality matters more than almost anything else.

What this involves
  • Web scraping and Common Crawl processing
  • Language detection and quality filtering
  • Deduplication using MinHash or exact matching
  • Toxic content filtering and removal
  • Custom data mixtures tuned via ablations
Step 2

Pre-train on a GPU Cluster

Pre-training · Weeks to Months

Predict the next word, billions of times, across thousands of GPUs. The model reads everything and learns language, reasoning, and world knowledge from prediction alone.

Training setup
  • AdamW optimizer with β1=0.9, β2=0.95
  • BF16 mixed precision throughout
  • LR warmup (1–5%) → stable → cosine decay
  • Gradient clipping to prevent explosion
  • Distributed across 1,000–10,000 H100s
Step 3

Supervised Fine-Tuning (SFT)

Fine-tuning · Days

Train on thousands of high-quality (prompt, ideal response) pairs. The model learns to follow instructions. Usually uses LoRA to save compute.

SFT details
  • Human-written (instruction, response) pairs
  • Diverse tasks: summarization, coding, reasoning
  • LoRA adapters trained at rank 16–64
  • Lower learning rate than pre-training
  • 1–3 epochs maximum to avoid overfitting
Step 4

Alignment via DPO or RLHF

Alignment · Days

Collect human preference data. Apply DPO to make the model lean toward genuinely helpful, honest, safe answers. Multiple rounds targeting different failure modes.

Alignment pipeline
  • Human annotators rank response pairs
  • DPO loss applied on preference data directly
  • Multiple rounds targeting different failure modes
  • Safety evaluations between each round
  • Some labs use GRPO/PPO for reasoning models
Step 5

Evaluate & Red-Team

Evaluation · Ongoing

Run through thousands of benchmarks and adversarial prompts. Find failure modes. Fix with targeted fine-tuning. Repeat until ready.

Evaluation suite
  • MMLU, HellaSwag, TruthfulQA, HumanEval
  • Adversarial red-teaming for safety failures
  • Human preference A/B tests
  • Domain evals: medical, legal, code
  • Regression testing vs previous checkpoints
Step 6

Quantize & Deploy

Inference · Production

Compress weights to 4-bit or 8-bit for fast, cheap inference. The model millions of users interact with is almost always quantized — often to INT8 or NVFP4 on modern hardware.

Deployment stack
  • INT8 / INT4 quantization for inference speed
  • vLLM or SGLang for efficient batched serving
  • KV cache management for long contexts
  • Speculative decoding for latency reduction
  • Served via REST API at scale
Glossary

Every term in one place.

Parameters / Weights
The billions of numbers that define what the model knows. Learned, not hand-coded.
Token
A word or word-chunk the model processes at once. "Hello world" = 2 tokens.
Loss
A number measuring how wrong the model was. All training = making this go down.
Gradient
Direction to nudge each weight to reduce loss. Calculated by backpropagation.
Epoch
One full pass through the training data. Most LLM pre-training is under 1 epoch.
Batch
Group of examples processed together before one weight update. Larger = more stable.
Overfitting
Memorizing training data instead of learning patterns. Fails on new inputs.
Checkpoint
Saved snapshot of model weights at a specific training step.
Inference
Running the trained model to generate outputs. Not training — this is production.
Quantization
Compressing weights to fewer bits (INT4/INT8) for faster, cheaper inference.
Context Window
Maximum tokens the model can see at once. Claude 3: 200K. GPT-4: 128K.
SFT
Supervised Fine-Tuning — training on human-written (prompt, answer) pairs post pre-training.
VRAM
On-chip GPU memory. Determines maximum model size you can load and train.
PEFT
Parameter Efficient Fine-Tuning. Umbrella term covering LoRA and similar methods.