A 70B model needs 140 GB at float16. Your GPU has 24 GB. Quantization is the art of making models smaller without making them dumber — from the basics of bits and buckets to GPTQ, AWQ, NF4 and beyond.
Every parameter in a neural network is a number. In float32, each number costs 4 bytes. In float16, 2 bytes. A 70B parameter model at float16 needs 140 GB of VRAM just to store the weights — before activations, KV cache, or gradients. The most powerful consumer GPU (RTX 4090) has 24 GB. Even an A100 has only 80 GB.
Quantization is a controlled approximation: represent each weight using fewer bits, accepting a small error in exchange for dramatic reductions in memory and compute. Done carefully, a 4-bit quantized 70B model fits in 35–40 GB, runs on two consumer GPUs, and scores within 1–2% of the original on most benchmarks.
GPU reference: RTX 4090 = 24 GB · A100 = 80 GB · H100 = 80 GB · 2×A100 = 160 GB
Quantization-Aware Training. Simulate quantization noise during training so the model learns to be robust to it. Best accuracy — highest cost.
Quantize an already-trained model. No retraining. Uses a small calibration dataset to find optimal quantization parameters. The practical standard.
Quantize weights to 4-bit or 8-bit but keep activations in float16. Best accuracy-speed tradeoff for LLM inference. GPTQ, AWQ, NF4 all do this.
Quantize both weights and activations to int8. Enables integer matrix multiply — the fastest path on modern hardware. SmoothQuant, LLM.int8().
Different layers at different precisions. Sensitive layers (first/last, attention outputs) stay in float16. Less sensitive MLP layers go to 4-bit. Best quality per byte.
In QAT, you simulate quantization noise during the forward pass while training. The model sees rounded, clipped values during each step and learns to produce robust weights that tolerate the precision loss. The backward pass still uses full precision gradients — the quantization is a fake forward operation.
When to use QAT: When you own the training pipeline and need maximum accuracy at the target precision. Apple uses QAT for all on-device CoreML models. The cost is real — you're adding quantization simulation overhead to every training step. For a 70B model, this is only feasible for very large compute budgets.
PTQ quantizes an already-trained model using a small calibration dataset — typically 128 to 512 samples — to measure the statistical distribution of weights and activations. No gradient computation. No retraining. You can quantize any model you can load, including models you downloaded from HuggingFace.
The calibration dataset matters more than people realize. Bad calibration data (e.g., using random text when the model is a code specialist) can introduce significant accuracy loss even with advanced quantization methods. Always use domain-representative samples.
For most production LLM deployments, PTQ + careful calibration is the pragmatic choice. QAT shines for small edge models where accuracy is non-negotiable.
In vision models, weight and activation distributions are relatively smooth. In LLMs, a small number of "outlier" values — typically ~0.1% of dimensions in large models — have magnitudes 10–100× larger than the rest. Simple uniform quantization assigns most of its precision budget to representing these outliers, leaving the remaining 99.9% badly approximated. This is the root cause behind every advanced quantization method in this blog — they all tackle outliers differently.
A 32-bit float uses 1 sign bit, 8 exponent bits, and 23 mantissa bits — representing any value between ±3.4×10³⁸ with ~7 decimal digits of precision. Quantization maps this into a much smaller representation. Understanding the two types of quantization is essential before diving into GPTQ or AWQ.
Each colored cell = 1 bit. Gray cells = bits saved vs FP32. Lower precision = more gray = less memory but more rounding error.
A single scale for an entire weight matrix is too coarse — different rows have wildly different ranges. Groupwise quantization divides each row into groups of g weights (typically g=128) and computes a separate scale per group. This dramatically reduces quantization error at the cost of slightly more metadata storage.
GPTQ is the most widely deployed quantization method for large language models. The key insight: don't quantize weights independently. Use second-order information (the Hessian) to compensate for each weight's quantization error by adjusting the remaining unquantized weights.
It builds on a line of work called Optimal Brain Compression (OBC) / Optimal Brain Surgeon (OBS) from the 1990s. The GPTQ paper scaled it to 175B models and made it run in hours rather than weeks.
Green = already quantized (compensated). Orange = currently quantizing. Gray = still in FP16. The Hessian compensation adjusts remaining columns after each quantization step.
The naïve Hessian update is O(n³) per layer — catastrophically slow for large weight matrices. GPTQ introduces two engineering tricks that make it practical:
Instead of updating the Hessian inverse after every weight, GPTQ batches 128 columns together, applies the same compensation formula, then recomputes. This amortizes the matrix inversion cost and maps efficiently to GPU tensor cores. Result: quantizing 175B parameters takes ~4 hours on a single A100 instead of weeks.
The raw Hessian inverse is numerically unstable — small calibration datasets and floating-point errors accumulate. GPTQ uses Cholesky decomposition (H = LLᵀ) to compute the inverse in a numerically stable form. Without this, quantization error grows catastrophically in later layers. With it, INT4 GPTQ is within 0.5–1 perplexity point of FP16 on most models.
GPTQ is weight-only — weights are stored in INT4, but dequantized back to FP16 for the actual matrix multiply. You save memory (4-bit storage) but don't get the throughput of INT4 compute. On GPUs without INT4 tensor cores, this means GPTQ is primarily a memory optimization, not a speed optimization.
AWQ takes a different perspective on the outlier problem. Rather than compensating after quantization (like GPTQ), it rescales weights before quantization so that the most important weights use more of the quantization range. The key observation: not all weights are equally important — the ones with large corresponding input activations matter disproportionately.
The AWQ Insight: Only ~1% of weights are "salient" — they correspond to large activation magnitudes and contribute disproportionately to output quality. Protecting just these 1% in higher precision preserves performance remarkably well. But instead of keeping them in FP16 (wasteful), AWQ scales them to use the full INT4 range and scales everything else down proportionally. Zero extra memory. Better accuracy than GPTQ on many tasks.
Left: raw weights — most cluster near zero, outliers waste quantization range. Right: after AWQ scaling — important channels expanded to fill the INT4 range uniformly.
| Dimension | GPTQ | AWQ |
|---|---|---|
| Core idea | Compensate after rounding | Scale before rounding |
| Second-order info | ✅ Hessian | ❌ First-order only |
| Calibration cost | Higher (Hessian inversion) | Lower (grid search) |
| Speed (inference) | Similar | Similar |
| Accuracy at INT4 | Slightly better on perplexity | Better on instruction following |
| Multi-modal models | Requires care | Better — handles vision tokens well |
| Recommended for | Pure language models | Instruction-tuned, multi-modal |
| Popular implementations | AutoGPTQ | AutoAWQ |
Tim Dettmers' bitsandbytes library and the LLM.int8() paper solved a problem that had blocked INT8 quantization of large models: the outlier problem. Standard INT8 quantization of LLMs above ~6B parameters suffers catastrophic accuracy loss because ~0.1% of activation dimensions have values 100× larger than the rest, dominating the quantization range.
The solution is an elegant mixed-precision matrix multiplication: decompose the weight matrix into two parts based on whether the corresponding activation dimension is an outlier.
Orange columns = outlier dimensions (FP16). Teal columns = normal dimensions (INT8). LLM.int8() routes each column to the right compute path automatically.
bitsandbytes implements two INT8 strategies: absmax (symmetric, scale = max(|X|)/127, zero-point = 0) and zero-point (asymmetric, maps the full float range to [−128, 127] using both scale and offset). Absmax is simpler and faster. Zero-point is more accurate for asymmetric distributions (e.g., ReLU outputs that are always non-negative). LLM.int8() uses absmax for the INT8 path and FP16 for the outlier path.
The practical result: LLM.int8() reduces memory by ~50% vs FP16 with essentially zero accuracy loss on models above 6.7B parameters (smaller models have fewer outliers and don't need the decomposition). It's the safest quantization method — but INT8 weights still require dequantization before compute on most GPUs, so the throughput benefit is smaller than you'd expect.
NF4 is the quantization format behind QLoRA — the method that made fine-tuning 65B models on a single GPU possible. It's not just a smaller integer — it's a carefully designed 4-bit datatype that is information-theoretically optimal for normally distributed data.
The key insight: neural network weights, after training, follow an approximately normal (Gaussian) distribution centered at zero. Standard INT4 maps uniformly-spaced integer values to the weight range — which wastes precision because the uniform grid doesn't match the non-uniform density of the normal distribution. NF4 fixes this with quantile quantization.
INT4 places levels uniformly across the range — half its precision goes to values rarely seen. NF4 concentrates levels near zero where weights cluster — minimizing average quantization error for Gaussian weight distributions.
QLoRA introduces a second innovation on top of NF4: double quantization. Each group of 64 weights shares a quantization constant (scale) stored as a 32-bit float — that's 0.5 bytes per weight overhead. Double quantization quantizes these scale constants themselves using 8-bit floats, reducing the overhead to ~0.127 bytes per weight. For a 65B model, this saves about 2 GB.
Trained model weights follow a Gaussian distribution because SGD with L2 regularization (weight decay) actively pushes them toward zero. NF4's quantile-based levels perfectly match this. But activations don't follow a Gaussian — after ReLU they're clipped positive, after attention softmax they're in [0,1], after LayerNorm they can be any shape. NF4 is specifically designed for weight-only quantization; INT8/FP8 remains the right choice for activations.
SmoothQuant (Xiao et al., 2022) solves a different problem: activations are harder to quantize than weights because their distribution changes per-input (you can't run calibration on every new prompt). The insight: shift the quantization difficulty from activations to weights, which are static and easier to calibrate.
NVIDIA H100 GPUs introduced native FP8 tensor cores. FP8 is a floating-point format (not integer) with two variants: E4M3 (4 exponent bits, 3 mantissa, range ±448) and E5M2 (5 exponent bits, 2 mantissa, range ±57344). Unlike INT8 which requires dequantization to do matrix multiply, FP8 runs natively — meaning actual speedup, not just memory savings.
Meta's Llama-3 405B and Google's Gemini 1.5 both used FP8 training on parts of the forward pass. The FP8 E4M3 format for weights and FP8 E5M2 for gradients is becoming a standard. FP8 inference on H100 achieves 2× throughput vs FP16 with near-identical accuracy — the next generation of fast inference will be FP8, not INT4.
GGUF (GPT-Generated Unified Format) is llama.cpp's model format — it bundles weights, tokenizer, and quantization metadata into a single file. It supports a range of quantization types labeled Q2_K through Q8_0, where the number is the target bit-width and the letter indicates the groupwise strategy.
| GGUF Type | Bits/Weight | 70B Model Size | Quality | Use Case |
|---|---|---|---|---|
| Q2_K | ~2.6 bits | ~24 GB | ⚠️ Noticeable loss | Ultra-low memory only |
| Q4_K_S | ~4.4 bits | ~38 GB | ✅ Good | Standard 4-bit |
| Q4_K_M | ~4.8 bits | ~42 GB | ✅ Very good | Recommended default |
| Q5_K_M | ~5.7 bits | ~50 GB | ✅✅ Excellent | Quality-conscious deployment |
| Q6_K | ~6.6 bits | ~57 GB | ✅✅ Near FP16 | Maximum quality INT |
| Q8_0 | 8 bits | ~70 GB | ✅✅✅ FP16 quality | Reference / highest quality |
K-quants in GGUF: The _K suffix means "k-quants" — a mixed-precision scheme where important layers (embeddings, attention output, first/last transformer layers) are quantized at higher precision than less important layers. Q4_K_M uses Q6_K for ~10% of layers and Q4_K for the rest. This mixed approach is why K-quants consistently outperform non-K variants at the same average bit-width.
Bubble area ∝ inference throughput improvement vs FP16. Top-right = ideal (high accuracy, high compression). No method reaches both extremes.
| Method | Precision | Memory vs FP16 | Accuracy Loss | Speed vs FP16 | Calibration | Best For |
|---|---|---|---|---|---|---|
| FP16 Baseline | 16-bit float | 1.0× | 0% | 1.0× | None | Max accuracy, ample VRAM |
| LLM.int8() | INT8 + FP16 | 2.0× | ~0% | ~1.0× (memory bound) | None | Safe drop-in, large models |
| SmoothQuant | INT8 W+A | 2.0× | <0.5% | 1.5–1.8× (INT8 matmul) | Small | High throughput inference |
| GPTQ | INT4 | 3.5–4.0× | 0.5–1.5% | 1.5–2.0× | Yes (calibration set) | Language models, perplexity-sensitive |
| AWQ | INT4 | 3.5–4.0× | 0.5–1.5% | 1.5–2.0× | Yes (smaller) | Instruction-tuned, multi-modal |
| NF4 (QLoRA) | NF4 4-bit | 4.0×+ | ~1% | 1.5× (dequant on-the-fly) | None | Training + inference on consumer GPU |
| GGUF Q4_K_M | ~4.8-bit mixed | 3.3× | <1% | 1.5–2.5× (CPU + GPU) | None (pre-computed) | Local deployment, llama.cpp |
| FP8 (H100) | 8-bit float | 2.0× | ~0% | 2.0× (hardware native) | Small | Production serving on H100 |
GGUF Q4_K_M via llama.cpp for inference. NF4 via bitsandbytes if you're also fine-tuning. Both work on 24 GB VRAM for 70B models split across CPU+GPU.
GPTQ or AWQ INT4 for maximum model size per GPU. SmoothQuant INT8 if you need better accuracy. LLM.int8() as the safe no-calibration option.
FP8 on H100 is the state of the art — 2× throughput, near-zero accuracy cost. SmoothQuant INT8 on A100/A10 if FP8 hardware isn't available.
QAT with INT8 (CoreML / ONNX) for maximum accuracy. GGUF Q4_K_S for llama.cpp on phone. Target 1–3B models — quantization helps but model size is the real constraint.
QLoRA with NF4 — the only practical way to fine-tune 13B+ models on consumer hardware. Base model in NF4, LoRA adapters in BF16. bitsandbytes handles everything.
My take: The quantization landscape is converging on two tiers. For production at scale, FP8 on H100/H200 will dominate within 2 years — it's hardware-native, lossless, and fast. For local deployment and fine-tuning, 4-bit methods (AWQ, GGUF Q4_K_M, NF4) have hit a quality plateau where further research yields diminishing returns. The interesting frontier is pushing to 2-bit quantization without catastrophic loss — QuIP#, AQLM, and BitNet 1.58b are early results suggesting it's possible, but not yet practical at scale.