Deep Dive · Google Research · ICLR 2026

TurboQuant

Google just solved the KV cache memory bottleneck — the single biggest cost driver in LLM inference. 6× compression, 8× speedup, zero accuracy loss, no retraining. Here's the algorithm in full detail.

March 2026 · Prateek Singh, PhD
KV Cache Quantization PolarQuant QJL Vector Search
scroll

Every token you generate stores 16-bit vectors.
They add up fast.

During transformer inference, key and value projections for every token in the context must be stored so attention can be computed without recomputation. This KV cache is your model's working memory. At 128K context, a 70B model's KV cache alone exceeds 140 GB of VRAM — more than the model weights themselves.

Previous attempts at KV quantization suffered from a hidden problem: quantization constants — the scale factors and normalisation values needed to dequantize — had to be stored in full precision alongside the compressed data, adding 1–2 extra bits per value and partially defeating the compression. TurboQuant eliminates this overhead entirely through a two-stage mathematical pipeline.

📉

6× Memory Reduction

FP16 KV cache compressed to 3–4 bits with near-zero accuracy loss. 140 GB becomes ~23 GB for a 128K context 70B model.

8× Attention Speedup

4-bit TurboQuant achieves up to 8× faster attention logit computation vs 32-bit keys on H100 GPUs in JAX benchmarks.

🎯

No Training Needed

Data-oblivious and training-free. No calibration set, no fine-tuning, no model-specific tuning. Works on any transformer architecture.

📐

Near Information-Theoretic Optimal

TurboQuant operates near the known theoretical lower bound for quantization distortion — not just a heuristic, a provably near-optimal method.

💡

The Pied Piper reaction: The paper dropped March 24, 2026. Within 24 hours, memory chip stocks fell sharply — Micron and NAND manufacturers hit at market open. Cloudflare CEO Matthew Prince called it "Google's DeepSeek moment." TechCrunch noted the internet immediately dubbed it Pied Piper (from HBO Silicon Valley). The community reaction was proportionate: this closes in on the information-theoretic limit for KV cache compression. The remaining room to optimise is narrow.

01
Why Traditional KV Quantization Was Leaky
The hidden cost that made conventional methods less effective than their headline numbers
+1–2 overhead bits Per-block constants FP32 scale storage

Standard vector quantization maps continuous values to a smaller discrete set. For INT4 quantization, you round each float16 value to the nearest of 16 integer levels. Simple in theory. But in practice, quantizing a vector of floats requires knowing the scale and zero-point of that vector — how to map integers back to floats when you need them.

These constants must be stored alongside the compressed data, in full precision (FP32). For block-wise quantization with block size 64, this adds 32/64 = 0.5 bits per value at FP32. For block size 16, it's 2 bits per value — partially undoing the compression you just performed. This overhead has been the fundamental ceiling on how aggressively you can compress the KV cache.

Traditional Block-Wise KV Quantization — Effective Bit Rate
Target: compress KV vectors to b bits per element
But per-block scale requires FP32 (32 bits) stored per block of size g:

effective_bits = b + 32 / g ← scale overhead per element

g = 64: effective_bits = 4 + 0.50 = 4.5 bits (target was 4.0)
g = 16: effective_bits = 4 + 2.00 = 6.0 bits (target was 4.0 — 50% overhead!)

TurboQuant overhead: 0 bits (zero quantization constants to store)

TurboQuant's key architectural insight: if you can transform the data so its distribution is known in advance, you don't need to store any quantization constants at all. The quantizer can be precomputed once offline and applied universally. This is exactly what PolarQuant achieves through random rotation.

KV Cache Memory Growth vs Context Length

KV cache memory grows linearly with context length and model size — it's the dominant memory cost in long-context inference. TurboQuant 4-bit compresses the entire shaded region by 6×.

02
The TurboQuant Pipeline
Two-stage compression — PolarQuant for the bulk, QJL for the residual
Stage 1: PolarQuant Stage 2: QJL (1 bit) Zero overhead Training-free

TurboQuant works by splitting the compression budget into two stages. The first stage (PolarQuant) handles the bulk of compression — typically b−1 bits — capturing the dominant structure of each vector with near-zero overhead. The second stage (QJL) uses the remaining 1 bit to correct the residual error left by the first stage. Together they achieve optimal rate-distortion with zero memory overhead.

TurboQuant — Full Pipeline for one KV vector
Input KV vector x ∈ ℝd (FP16, d=128 per head)
↓ Stage 1: PolarQuant
Random orthogonal rotation Π·x
Pair coordinates → polar form (r, θ)
Lloyd-Max quantize θ using (b−1) bits
↳ Quantization levels precomputed offline — zero constants stored at runtime
↓ Residual error
Residual ε = x − x̂ (what PolarQuant missed)
QJL: project onto random sign vector → 1 bit
↳ Eliminates bias in residual — zero extra memory overhead for 1-bit sign
↓ Output
Compressed KV vector: (b−1) bits from PolarQuant + 1 bit from QJL = b bits total
Total overhead: 0 bits of quantization constants stored

The key design decision is how to split the budget: PolarQuant takes b−1 bits because it captures the majority of the vector's information — the angles in polar coordinates follow a known, compressible distribution. QJL takes exactly 1 bit because even a single sign bit of the residual, processed correctly through the JL estimator, eliminates the quantization bias that would otherwise accumulate across the attention computation.

TurboQuant — Visual Architecture Flowchart

Left path: PolarQuant compresses the bulk of each KV vector using (b−1) bits with zero runtime overhead. Right path: QJL contributes 1 residual-correction bit. Both paths combine to produce a b-bit compressed vector.

03
PolarQuant
Random rotation + polar coordinates = a distribution you can precompute a quantizer for
AISTATS 2026 Cartesian → Polar Lloyd-Max optimal Zero runtime overhead

The core problem with quantizing arbitrary vectors is that their values can be distributed any way — skewed, heavy-tailed, clustered. A quantizer optimal for one distribution is terrible for another. PolarQuant sidesteps this by first applying a random orthogonal rotation to the vector.

Random Rotation — Why It Works
Let Π be a random orthogonal matrix (d × d), pre-generated offline
Apply rotation to input vector x:
x̃ = Π · x ← rotated vector, same norm, spread uniformly

Key property (Johnson-Lindenstrauss): after random rotation,
each coordinate x̃_i follows approximately N(0, ‖x‖²/d)
→ distribution is now KNOWN and PREDICTABLE regardless of input

Use QR decomposition to build Π (once offline, reuse forever):
G ~ N(0, Id×d) Π, _ = QR(G)

After rotation, every coordinate follows approximately the same Gaussian distribution regardless of the original vector's structure. This is the crucial enabler: you can now design a quantizer offline, once, for this known distribution, and apply it to every vector at runtime without storing any per-vector constants.

Rather than quantizing coordinates independently (Cartesian quantization), PolarQuant converts pairs of rotated coordinates (x̃₂ᵢ, x̃₂ᵢ₊₁) into polar form. The key insight: the angular component θ has a much more concentrated, predictable distribution than the raw Cartesian values — making it far more compressible.

PolarQuant — Polar Coordinate Quantization
For each pair of rotated coordinates (x̃₂ᵢ, x̃₂ᵢ₊₁):
r_i = √(x̃₂ᵢ² + x̃₂ᵢ₊₁²) ← radius (magnitude of pair)
θ_i = atan2(x̃₂ᵢ₊₁, x̃₂ᵢ) ← angle in [−π, π]

After random rotation: θ_i is uniformly distributed on [−π, π]
→ optimal quantizer for uniform distribution = uniform grid!

Quantize angle with (b−1) bits:
θ̂_i = round(θ_i · 2b−1 / (2π)) · (2π / 2b−1)

Reconstruct: x̂₂ᵢ = r_i·cos(θ̂_i), x̂₂ᵢ₊₁ = r_i·sin(θ̂_i)
Note: radius r_i is NOT quantized — it cancels in attention (K·Qᵀ normalises)
📐

Why the radius doesn't need to be stored: In the attention mechanism, you compute softmax(Q·Kᵀ / √d). The softmax operation is scale-invariant — multiplying all logits by a constant doesn't change the output. So the radius r_i, which scales the key vector, cancels out in the attention score computation. PolarQuant stores only the angle, which determines the direction of the vector. This is the core trick that eliminates all storage overhead.

Cartesian vs Polar Representation — Why Polar Wins

Left: raw coordinates clustered unevenly — quantizing them requires per-vector scale constants. Right: after random rotation and polar conversion, angle θ is uniformly distributed — optimal quantizer computable offline once, reused for all vectors.

For a uniform angular distribution (which PolarQuant guarantees after rotation), the optimal quantizer is simply a uniform grid — equally spaced levels across [−π, π]. This can be computed analytically once and hardcoded. For non-uniform distributions (such as the Beta distribution that arises in some head configurations), PolarQuant uses the Lloyd-Max algorithm to find the optimal level placements. Either way, the quantizer is computed offline and never changes — zero runtime overhead, optimal distortion.

PolarQuant — Compression at Different Bit Widths

Angle quantization error vs bit width. 4-bit PolarQuant achieves near-lossless angular precision — error is far below the noise floor of FP16 arithmetic rounding.

04
QJL — Quantized Johnson-Lindenstrauss
One bit of residual correction that eliminates quantization bias entirely
AAAI 2025 1 bit per vector Bias-free estimator JL Transform

After PolarQuant compresses the vector using b−1 bits, there is a small residual error ε = x − x̂. If ignored, this error introduces a systematic bias into the attention score — every dot product Q·K is shifted by a small but consistent amount that accumulates across all attention heads and all tokens, degrading model quality.

QJL eliminates this bias with a single additional bit per vector. The insight comes from the Johnson-Lindenstrauss lemma: a random projection preserves inner products in expectation. You don't need to store the full residual — just its sign along a random direction.

QJL — Quantized Johnson-Lindenstrauss Transform
Residual after PolarQuant:
ε = x − x̂ ← small but introduces bias in Q·K

QJL: project residual onto random sign vector g ∈ {±1}^d:
b_ε = sign(gᵀ · ε) ← just 1 bit stored per KV vector

At query time, unbiased attention score estimator:
Q·K ≈ Q·x̂ + √d · b_ε · (gᵀ · Q) ← corrected dot product

Why it works: E[b_ε · (gᵀ·Q)] = E[sign(gᵀ·ε)·(gᵀ·Q)]
= (2/π)·‖ε‖·‖Q‖·cos(angle) — unbiased estimate of ε·Q

Memory cost: 1 bit per KV vector (vs 32·d bits for full residual)
🔬

The estimator trick explained plainly: You want to know gᵀ·ε (the inner product of the residual with the query). You don't store ε — you only store its sign along a random direction g. The JL estimator says: if you know the sign of the projection and the magnitude of the query along the same direction, you can construct an unbiased estimate of the original inner product. The estimate has variance, but zero mean error. Over the softmax in attention, this variance is harmless — the softmax is robust to small symmetric noise. The bias from PolarQuant, however, would not cancel — it would shift all attention scores in one direction. QJL's purpose is solely to eliminate that systematic drift.

Why 1 Bit Is Enough — And Not More

The residual ε after PolarQuant is already small — PolarQuant captures the dominant structure of the vector. What remains is high-frequency, near-random noise. The JL lemma tells us that for random noise, a single bit of sign information along a random direction provides the maximum information gain per bit — additional bits on the residual have strongly diminishing returns. This is why TurboQuant's optimal split is always (b−1) bits for PolarQuant + 1 bit for QJL, regardless of the total budget b.

QJL — 1-Bit Residual Correction Mechanism

PolarQuant leaves a small residual error ε. QJL projects ε onto a random sign vector g (stored offline) and records only the sign — 1 bit. At query time, this single bit corrects the bias in the dot product Q·K without storing any additional data per vector.

TurboQuant — Animate Full Pipeline

A KV vector enters at FP16. PolarQuant captures direction with (b−1) bits. QJL adds 1 residual correction bit. Output: b-bit compressed vector with zero stored constants.

05
The Numbers
What TurboQuant actually achieves on Gemma, Mistral, and vector search
H100 GPU benchmarks JAX baseline Gemma · Mistral
Attention Logit Speedup vs FP32 Baseline — H100 GPU

Measured on NVIDIA H100 GPUs in JAX. TurboQuant 4-bit achieves 8× speedup vs FP32 unquantized keys. Even 3-bit delivers 6×. Compared against a highly optimised JAX FP32 baseline.

KV Cache Memory — 70B Model at 128K Context

At 128K tokens, FP16 KV cache requires ~140 GB. TurboQuant 4-bit reduces this to ~35 GB — fitting on a single 80 GB A100. At 3-bit, under 27 GB — fits on a single H100.

Accuracy was evaluated on Gemma and Mistral using LongBench (multi-task long-context understanding) and the "needle in a haystack" retrieval task — where a key fact is hidden deep in a long context and the model must retrieve it exactly. PolarQuant at 4-bit was near-lossless on both. At 3-bit, quality starts degrading noticeably on models smaller than 8B, but remains within acceptable bounds on 7B+ models.

Key Sensitivity: Values vs Keys

Community experiments have found an important asymmetry: value quantization is the bottleneck, not key quantization. At 2-bit values, cosine similarity degrades to ~0.94 — noticeable quality loss. At 4-bit values, cosine similarity is 0.997 — essentially FP16 quality. Keys are less sensitive because they interact through the softmax, which is robust to small perturbations. Practical recommendation: use 4-bit for values, 3-bit for keys if budget is tight.

Beyond LLM inference, TurboQuant was evaluated as a general-purpose vector search algorithm on the GloVe dataset (d=200, 1.2M vectors). It achieved superior recall ratios across top-k retrieval tasks compared to both Product Quantization (PQ) and RaBitQ — without requiring large codebooks, dataset-specific tuning, or any training data. This matters because vector search underpins Google Search, YouTube recommendations, and advertising targeting.

TurboQuant vs the Field

Five categories of competing approaches — each with different tradeoffs. TurboQuant is the first to combine zero overhead, training-free operation, and near-optimal rate-distortion in one method.

MethodBitsOverheadTraining?Vector Search?AccuracySpeedup
FP16 (baseline) 16 bits 0 No Reference
vLLM FP8 KV 8 bits Low No Near-lossless ~1.5×
GPTQ (weight quant) 4 bits Group scale (FP16) Calibration ~0.5–1% loss Memory only
KVQuant 4 bits Per-vector scale Calibration Good on 7B+ 2–3×
Ollama q4_0/q8_0 4–8 bits Block scale No Unpredictable Memory only
Product Quant. (PQ) Variable Large codebook ✅ Required Good 3–4×
RaBitQ 1 bit Norm storage No Lossy at low bits 5–6×
TurboQuant (4-bit) 4 bits Zero No Near-lossless
TurboQuant (3-bit) 3 bits Zero No Good on 7B+
🏆

Zero Overhead

The only 4-bit KV method that stores literally zero quantization constants. Every competing method stores at least 0.5 extra bits per element for scales or codebooks. TurboQuant: 0.

🔌

Works for Both KV + Vector Search

FP8, GPTQ, KVQuant — all LLM-specific. TurboQuant is a general vector quantization algorithm that also beats PQ and RaBitQ on the GloVe benchmark. One method, two massive use cases.

📐

Near Information-Theoretic Optimal

TurboQuant operates near the Shannon lower bound for quantization distortion. Most methods are heuristics. This is a theorem — provably near-optimal rate-distortion for the given bit budget.

Hardware-Friendly

The polar coordinate representation maps naturally to efficient bitwise operations. 4-bit attention logits on H100 achieve 8× speedup — not just memory savings but actual compute acceleration.

Small Models (Under 3B) Need Care

At 3-bit, models below 3B parameters show noticeable output degradation — repetitive or incoherent text. Smaller models have less redundancy in their KV representations, making them more sensitive to quantization noise. Community experiments confirm 4-bit is safe down to ~1B; 3-bit should only be used on 7B+ models. This is not a fundamental limitation of TurboQuant's algorithm — it reflects the information content of small model activations.

No Official Implementation Yet — Community is Moving Fast

As of March 2026, there is no official Google open-source release. The paper provides theory and pseudocode. However: community integration into llama.cpp is tracked in issue #20969; MLX experiments on Apple Silicon report ~5× compression with 99.5% quality retention; vLLM has an open feature request for native TurboQuant KV quantization. Official code is expected Q2 2026.

🔬

My take: TurboQuant is the most theoretically rigorous KV cache compression method published to date. The combination of zero overhead, training-free operation, and near-optimal rate-distortion is genuinely novel. What makes it important is not just that it compresses more — it's that it establishes where the limit is. KV cache compression is approaching its information-theoretic boundary. The next major efficiency gains in LLM inference will have to come from somewhere else: attention architecture changes (like Mamba hybrid models), memory hierarchy design, or fundamentally different ways of representing context. TurboQuant doesn't just solve a problem — it closes the chapter on this particular problem.