Deep Dive · Model Compression

Quantization
in LLMs

A 70B model needs 140 GB at float16. Your GPU has 24 GB. Quantization is the art of making models smaller without making them dumber — from the basics of bits and buckets to GPTQ, AWQ, NF4 and beyond.

March 2026 · Prateek Singh, PhD
Quantization GPTQ AWQ NF4 · bitsandbytes
scroll

The memory wall that
every large model hits

Every parameter in a neural network is a number. In float32, each number costs 4 bytes. In float16, 2 bytes. A 70B parameter model at float16 needs 140 GB of VRAM just to store the weights — before activations, KV cache, or gradients. The most powerful consumer GPU (RTX 4090) has 24 GB. Even an A100 has only 80 GB.

Quantization is a controlled approximation: represent each weight using fewer bits, accepting a small error in exchange for dramatic reductions in memory and compute. Done carefully, a 4-bit quantized 70B model fits in 35–40 GB, runs on two consumer GPUs, and scores within 1–2% of the original on most benchmarks.

Memory Required vs Precision — 70B Model 70B parameters
7B13B34B70B

GPU reference: RTX 4090 = 24 GB · A100 = 80 GB · H100 = 80 GB · 2×A100 = 160 GB

📐

Pre-Training Quant (QAT)

Quantization-Aware Training. Simulate quantization noise during training so the model learns to be robust to it. Best accuracy — highest cost.

🔬

Post-Training Quant (PTQ)

Quantize an already-trained model. No retraining. Uses a small calibration dataset to find optimal quantization parameters. The practical standard.

Weight-Only Quant

Quantize weights to 4-bit or 8-bit but keep activations in float16. Best accuracy-speed tradeoff for LLM inference. GPTQ, AWQ, NF4 all do this.

🚀

Weight + Activation Quant

Quantize both weights and activations to int8. Enables integer matrix multiply — the fastest path on modern hardware. SmoothQuant, LLM.int8().

🎯

Mixed Precision

Different layers at different precisions. Sensitive layers (first/last, attention outputs) stay in float16. Less sensitive MLP layers go to 4-bit. Best quality per byte.

01
Pre vs Post Training Quantization
The fundamental fork in the road — when do you quantize?

In QAT, you simulate quantization noise during the forward pass while training. The model sees rounded, clipped values during each step and learns to produce robust weights that tolerate the precision loss. The backward pass still uses full precision gradients — the quantization is a fake forward operation.

QAT — Straight-Through Estimator
Forward pass (quantized):
w_q = round( clip(w, −2b−1, 2b−1−1) ) ← discrete, non-differentiable

Backward pass (straight-through estimator):
∂L/∂w ≈ ∂L/∂w_q · 𝟙[|w| ≤ clip_val] ← pretend quant doesn't exist

The model optimizes weights that, when quantized, still minimize loss.

When to use QAT: When you own the training pipeline and need maximum accuracy at the target precision. Apple uses QAT for all on-device CoreML models. The cost is real — you're adding quantization simulation overhead to every training step. For a 70B model, this is only feasible for very large compute budgets.

PTQ quantizes an already-trained model using a small calibration dataset — typically 128 to 512 samples — to measure the statistical distribution of weights and activations. No gradient computation. No retraining. You can quantize any model you can load, including models you downloaded from HuggingFace.

PTQ Pipeline
Pretrained FP16 Model
128–512 calibration samples
Run forward pass, collect weight/activation stats
Compute scale + zero-point per layer
Round weights to target precision (INT8/INT4/NF4)
Quantized model ready

The calibration dataset matters more than people realize. Bad calibration data (e.g., using random text when the model is a code specialist) can introduce significant accuracy loss even with advanced quantization methods. Always use domain-representative samples.

QAT vs PTQ — Side-by-Side Comparison
QAT
PTQ

For most production LLM deployments, PTQ + careful calibration is the pragmatic choice. QAT shines for small edge models where accuracy is non-negotiable.

The Outlier Problem — Why LLMs Are Hard to Quantize

In vision models, weight and activation distributions are relatively smooth. In LLMs, a small number of "outlier" values — typically ~0.1% of dimensions in large models — have magnitudes 10–100× larger than the rest. Simple uniform quantization assigns most of its precision budget to representing these outliers, leaving the remaining 99.9% badly approximated. This is the root cause behind every advanced quantization method in this blog — they all tackle outliers differently.

02
Bits, Buckets & Error
How numbers get smaller — and what gets lost

A 32-bit float uses 1 sign bit, 8 exponent bits, and 23 mantissa bits — representing any value between ±3.4×10³⁸ with ~7 decimal digits of precision. Quantization maps this into a much smaller representation. Understanding the two types of quantization is essential before diving into GPTQ or AWQ.

Symmetric (Abs Max) Quantization
scale = max(|W|) / (2b−1 − 1) ← maps max weight to max int
W_q = round( W / scale ) ← INT values in [−2^(b-1), 2^(b-1)−1]
W_deq = W_q × scale ← dequantize for compute

Error = W − W_deq (quantization error)
For INT8: scale = max/127, ~0.8% average error
For INT4: scale = max/7, ~3.5% average error
Bit Precision — Value Representation

Each colored cell = 1 bit. Gray cells = bits saved vs FP32. Lower precision = more gray = less memory but more rounding error.

A single scale for an entire weight matrix is too coarse — different rows have wildly different ranges. Groupwise quantization divides each row into groups of g weights (typically g=128) and computes a separate scale per group. This dramatically reduces quantization error at the cost of slightly more metadata storage.

Groupwise Quantization — Group Size 128
For weight matrix W ∈ ℝ^(d_out × d_in), group_size = 128:
n_groups = d_in / 128

for each group g:
  scale_g = max(|W[:, g*128:(g+1)*128]|) / (2b−1−1)
  W_q[:, g*128:(g+1)*128] = round(W / scale_g)

Metadata overhead: (d_out × n_groups) extra fp16 scale values
For 4096×4096 matrix, g=128: 4096×32 = 131K extra fp16 values = 256 KB
vs compressed weights: 4096×4096×0.5 bytes = 8 MB — overhead is 3%
03
GPTQ
One-Shot Weight Quantization from the Hessian — 2022
Frantar et al. 2022 Weight-only INT4 / INT3 Hessian-based

GPTQ is the most widely deployed quantization method for large language models. The key insight: don't quantize weights independently. Use second-order information (the Hessian) to compensate for each weight's quantization error by adjusting the remaining unquantized weights.

It builds on a line of work called Optimal Brain Compression (OBC) / Optimal Brain Surgeon (OBS) from the 1990s. The GPTQ paper scaled it to 175B models and made it run in hours rather than weeks.

GPTQ — Layer-Wise Quantization with Hessian Compensation
Goal: minimize output error per layer independently
min ‖W·X − Q(W)·X‖²_F

Hessian (second-order) information:
H = 2·X·Xᵀ ← computed from calibration data, O(d²) cost

For each weight w_q in column order:
1. Quantize w_q: δ_q = quant(w_q) − w_q ← quantization error
2. Compensate remaining weights in same row:
   w_remaining −= δ_q · Hq,remaining / Hq,q ← Hessian update
3. Move to next column

This ensures quantizing w_q causes minimum damage to the next weights
GPTQ Column-Wise Quantization

Green = already quantized (compensated). Orange = currently quantizing. Gray = still in FP16. The Hessian compensation adjusts remaining columns after each quantization step.

The naïve Hessian update is O(n³) per layer — catastrophically slow for large weight matrices. GPTQ introduces two engineering tricks that make it practical:

Lazy Batch Updates

Instead of updating the Hessian inverse after every weight, GPTQ batches 128 columns together, applies the same compensation formula, then recomputes. This amortizes the matrix inversion cost and maps efficiently to GPU tensor cores. Result: quantizing 175B parameters takes ~4 hours on a single A100 instead of weeks.

Cholesky Decomposition Numerical Stability

The raw Hessian inverse is numerically unstable — small calibration datasets and floating-point errors accumulate. GPTQ uses Cholesky decomposition (H = LLᵀ) to compute the inverse in a numerically stable form. Without this, quantization error grows catastrophically in later layers. With it, INT4 GPTQ is within 0.5–1 perplexity point of FP16 on most models.

GPTQ is weight-only — weights are stored in INT4, but dequantized back to FP16 for the actual matrix multiply. You save memory (4-bit storage) but don't get the throughput of INT4 compute. On GPUs without INT4 tensor cores, this means GPTQ is primarily a memory optimization, not a speed optimization.

04
AWQ
Activation-aware Weight Quantization — protect what matters
Lin et al. 2023 Weight-only INT4 / INT3 Activation-aware

AWQ takes a different perspective on the outlier problem. Rather than compensating after quantization (like GPTQ), it rescales weights before quantization so that the most important weights use more of the quantization range. The key observation: not all weights are equally important — the ones with large corresponding input activations matter disproportionately.

💡

The AWQ Insight: Only ~1% of weights are "salient" — they correspond to large activation magnitudes and contribute disproportionately to output quality. Protecting just these 1% in higher precision preserves performance remarkably well. But instead of keeping them in FP16 (wasteful), AWQ scales them to use the full INT4 range and scales everything else down proportionally. Zero extra memory. Better accuracy than GPTQ on many tasks.

AWQ — Per-Channel Scaling
Saliency: weight importance ∝ activation magnitude
simportance = mean(|X|) ← average activation per input channel

Optimal scale (found by grid search):
s* = argmins ‖Q(s·W) · (X/s) − W·X‖

Intuition: scale up important weights → they use full INT4 range
Scale s absorbed into previous LayerNorm — no inference overhead

s* ≈ simportanceα where α ≈ 0.5 ← tuned per-model
AWQ — Weight Saliency & Scaling

Left: raw weights — most cluster near zero, outliers waste quantization range. Right: after AWQ scaling — important channels expanded to fill the INT4 range uniformly.

DimensionGPTQAWQ
Core ideaCompensate after roundingScale before rounding
Second-order info✅ Hessian❌ First-order only
Calibration costHigher (Hessian inversion)Lower (grid search)
Speed (inference)SimilarSimilar
Accuracy at INT4Slightly better on perplexityBetter on instruction following
Multi-modal modelsRequires careBetter — handles vision tokens well
Recommended forPure language modelsInstruction-tuned, multi-modal
Popular implementationsAutoGPTQAutoAWQ
05
bitsandbytes & LLM.int8()
The mixed decomposition that makes INT8 practical for LLMs
Dettmers et al. 2022 INT8 + FP16 No calibration Mixed decomposition

Tim Dettmers' bitsandbytes library and the LLM.int8() paper solved a problem that had blocked INT8 quantization of large models: the outlier problem. Standard INT8 quantization of LLMs above ~6B parameters suffers catastrophic accuracy loss because ~0.1% of activation dimensions have values 100× larger than the rest, dominating the quantization range.

The solution is an elegant mixed-precision matrix multiplication: decompose the weight matrix into two parts based on whether the corresponding activation dimension is an outlier.

LLM.int8() — Mixed Decomposition
Step 1: Identify outlier dimensions in activations X
O = {i : max|X[:,i]| > threshold} ← ~0.1% of dimensions, threshold≈6.0

Step 2: Split matrix multiply into two parts
Y = X·Wᵀ
 = X[:,O]·W[O,:]ᵀ ← outlier dims: keep in FP16
 + X[:,!O]·W[!O,:]ᵀ ← normal dims: quantize to INT8

~99.9% of the compute is INT8 · INT8 (fast)
~0.1% stays FP16 · FP16 (handles outliers safely)
LLM.int8() Mixed Decomposition — Live View

Orange columns = outlier dimensions (FP16). Teal columns = normal dimensions (INT8). LLM.int8() routes each column to the right compute path automatically.

Zero-Point vs Absmax Quantization in bitsandbytes

bitsandbytes implements two INT8 strategies: absmax (symmetric, scale = max(|X|)/127, zero-point = 0) and zero-point (asymmetric, maps the full float range to [−128, 127] using both scale and offset). Absmax is simpler and faster. Zero-point is more accurate for asymmetric distributions (e.g., ReLU outputs that are always non-negative). LLM.int8() uses absmax for the INT8 path and FP16 for the outlier path.

The practical result: LLM.int8() reduces memory by ~50% vs FP16 with essentially zero accuracy loss on models above 6.7B parameters (smaller models have fewer outliers and don't need the decomposition). It's the safest quantization method — but INT8 weights still require dequantization before compute on most GPUs, so the throughput benefit is smaller than you'd expect.

06
NF4 — NormalFloat4
Information-theoretically optimal for normal distributions
Dettmers et al. 2023 4-bit Quantile quant. QLoRA backbone

NF4 is the quantization format behind QLoRA — the method that made fine-tuning 65B models on a single GPU possible. It's not just a smaller integer — it's a carefully designed 4-bit datatype that is information-theoretically optimal for normally distributed data.

The key insight: neural network weights, after training, follow an approximately normal (Gaussian) distribution centered at zero. Standard INT4 maps uniformly-spaced integer values to the weight range — which wastes precision because the uniform grid doesn't match the non-uniform density of the normal distribution. NF4 fixes this with quantile quantization.

NF4 — Quantile Quantization Construction
Concept: place quantization levels at quantiles of N(0,1)
q_i = Φ−1( (i + 0.5) / 2k ) for i = 0, 1, ..., 2k−1

where Φ−1 is the inverse CDF of the standard normal
Result: each bucket contains roughly equal probability mass

NF4 values (16 levels, symmetric around 0):
[-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
 0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7229, 1.0]

These are NOT uniformly spaced — more levels near 0 where weights cluster
NF4 vs INT4 — Quantization Level Placement

INT4 places levels uniformly across the range — half its precision goes to values rarely seen. NF4 concentrates levels near zero where weights cluster — minimizing average quantization error for Gaussian weight distributions.

QLoRA introduces a second innovation on top of NF4: double quantization. Each group of 64 weights shares a quantization constant (scale) stored as a 32-bit float — that's 0.5 bytes per weight overhead. Double quantization quantizes these scale constants themselves using 8-bit floats, reducing the overhead to ~0.127 bytes per weight. For a 65B model, this saves about 2 GB.

Why NF4 Is Better Than INT4 for Weights — But Not for Activations

Trained model weights follow a Gaussian distribution because SGD with L2 regularization (weight decay) actively pushes them toward zero. NF4's quantile-based levels perfectly match this. But activations don't follow a Gaussian — after ReLU they're clipped positive, after attention softmax they're in [0,1], after LayerNorm they can be any shape. NF4 is specifically designed for weight-only quantization; INT8/FP8 remains the right choice for activations.

07
SmoothQuant, FP8 & GGUF
The rest of the quantization zoo — and when each makes sense

SmoothQuant (Xiao et al., 2022) solves a different problem: activations are harder to quantize than weights because their distribution changes per-input (you can't run calibration on every new prompt). The insight: shift the quantization difficulty from activations to weights, which are static and easier to calibrate.

SmoothQuant — Per-Channel Smoothing
Original: Y = (X · W) — X has outliers, hard to quantize
SmoothQuant: Y = (X / s) · (s · W) where s = smoothing scale

s_j = max(|X[:,j]|)α / max(|W[j,:]|)1−α ← per input channel

α = 0.5 → equal difficulty split between X and W
α → 1.0 → push all difficulty to weights (better for activation quant)
Result: both X/s and s·W are easier to quantize to INT8

NVIDIA H100 GPUs introduced native FP8 tensor cores. FP8 is a floating-point format (not integer) with two variants: E4M3 (4 exponent bits, 3 mantissa, range ±448) and E5M2 (5 exponent bits, 2 mantissa, range ±57344). Unlike INT8 which requires dequantization to do matrix multiply, FP8 runs natively — meaning actual speedup, not just memory savings.

FP8 Training at Scale

Meta's Llama-3 405B and Google's Gemini 1.5 both used FP8 training on parts of the forward pass. The FP8 E4M3 format for weights and FP8 E5M2 for gradients is becoming a standard. FP8 inference on H100 achieves 2× throughput vs FP16 with near-identical accuracy — the next generation of fast inference will be FP8, not INT4.

GGUF (GPT-Generated Unified Format) is llama.cpp's model format — it bundles weights, tokenizer, and quantization metadata into a single file. It supports a range of quantization types labeled Q2_K through Q8_0, where the number is the target bit-width and the letter indicates the groupwise strategy.

GGUF TypeBits/Weight70B Model SizeQualityUse Case
Q2_K~2.6 bits~24 GB⚠️ Noticeable lossUltra-low memory only
Q4_K_S~4.4 bits~38 GB✅ GoodStandard 4-bit
Q4_K_M~4.8 bits~42 GB✅ Very goodRecommended default
Q5_K_M~5.7 bits~50 GB✅✅ ExcellentQuality-conscious deployment
Q6_K~6.6 bits~57 GB✅✅ Near FP16Maximum quality INT
Q8_08 bits~70 GB✅✅✅ FP16 qualityReference / highest quality
📦

K-quants in GGUF: The _K suffix means "k-quants" — a mixed-precision scheme where important layers (embeddings, attention output, first/last transformer layers) are quantized at higher precision than less important layers. Q4_K_M uses Q6_K for ~10% of layers and Q4_K for the rest. This mixed approach is why K-quants consistently outperform non-K variants at the same average bit-width.

Every Method Side by Side

Accuracy Retention vs Memory Savings — 70B Model

Bubble area ∝ inference throughput improvement vs FP16. Top-right = ideal (high accuracy, high compression). No method reaches both extremes.

MethodPrecisionMemory vs FP16Accuracy LossSpeed vs FP16CalibrationBest For
FP16 Baseline 16-bit float 1.0× 0% 1.0× None Max accuracy, ample VRAM
LLM.int8() INT8 + FP16 2.0× ~0% ~1.0× (memory bound) None Safe drop-in, large models
SmoothQuant INT8 W+A 2.0× <0.5% 1.5–1.8× (INT8 matmul) Small High throughput inference
GPTQ INT4 3.5–4.0× 0.5–1.5% 1.5–2.0× Yes (calibration set) Language models, perplexity-sensitive
AWQ INT4 3.5–4.0× 0.5–1.5% 1.5–2.0× Yes (smaller) Instruction-tuned, multi-modal
NF4 (QLoRA) NF4 4-bit 4.0×+ ~1% 1.5× (dequant on-the-fly) None Training + inference on consumer GPU
GGUF Q4_K_M ~4.8-bit mixed 3.3× <1% 1.5–2.5× (CPU + GPU) None (pre-computed) Local deployment, llama.cpp
FP8 (H100) 8-bit float 2.0× ~0% 2.0× (hardware native) Small Production serving on H100

From Neural Compression to NF4

1993
Optimal Brain Surgeon (OBS)
Hassibi & Stork use Hessian-based weight removal for neural network compression. The theoretical foundation GPTQ would build on 30 years later.
OBSHessian pruning
2017
Quantization-Aware Training at Google
Google's mobile team formalizes QAT for TFLite, making INT8 quantization practical for on-device vision models. Straight-through estimator becomes standard.
QATTFLite
Aug 2022
LLM.int8() — bitsandbytes
Dettmers solves the outlier problem with mixed-precision decomposition. First method to quantize 175B+ models to INT8 with near-zero accuracy loss. Integrated into HuggingFace.
LLM.int8()bitsandbytes
Nov 2022
GPTQ — One-Shot 4-bit
Frantar et al. adapt OBS to quantize OPT-175B to INT4 in 4 hours on one GPU. GPTQ becomes the dominant INT4 method for open-weight models.
GPTQINT4
May 2023
QLoRA & NF4
Dettmers introduces NF4 as an information-theoretically optimal 4-bit format and combines it with LoRA fine-tuning. Fine-tuning 65B on a single GPU becomes reality.
NF4QLoRADouble Quant
Jun 2023
AWQ — Activation-Aware Scaling
Lin et al. show that scaling salient channels before quantization outperforms GPTQ on instruction following tasks. AWQ becomes default for multi-modal models.
AWQ
2024–25
FP8 Goes Production
H100 FP8 inference deployed at Meta, Google, and OpenAI. 2× throughput over FP16 with near-zero accuracy loss. The future of high-performance serving.
FP8H100

Which Method For Which Situation

💻

Local / Consumer GPU

GGUF Q4_K_M via llama.cpp for inference. NF4 via bitsandbytes if you're also fine-tuning. Both work on 24 GB VRAM for 70B models split across CPU+GPU.

🖥️

Single A100 Serving

GPTQ or AWQ INT4 for maximum model size per GPU. SmoothQuant INT8 if you need better accuracy. LLM.int8() as the safe no-calibration option.

High-Throughput Production

FP8 on H100 is the state of the art — 2× throughput, near-zero accuracy cost. SmoothQuant INT8 on A100/A10 if FP8 hardware isn't available.

📱

On-Device / Mobile

QAT with INT8 (CoreML / ONNX) for maximum accuracy. GGUF Q4_K_S for llama.cpp on phone. Target 1–3B models — quantization helps but model size is the real constraint.

🔧

Fine-Tuning on a Budget

QLoRA with NF4 — the only practical way to fine-tune 13B+ models on consumer hardware. Base model in NF4, LoRA adapters in BF16. bitsandbytes handles everything.

🔬

My take: The quantization landscape is converging on two tiers. For production at scale, FP8 on H100/H200 will dominate within 2 years — it's hardware-native, lossless, and fast. For local deployment and fine-tuning, 4-bit methods (AWQ, GGUF Q4_K_M, NF4) have hit a quality plateau where further research yields diminishing returns. The interesting frontier is pushing to 2-bit quantization without catastrophic loss — QuIP#, AQLM, and BitNet 1.58b are early results suggesting it's possible, but not yet practical at scale.