TurboQuant

Why Traditional KV Quantization Was Leaky

The hidden cost that made conventional methods less effective than their headline numbers

+1–2 overhead bits Per-block constants FP32 scale storage

Standard vector quantization maps continuous values to a smaller discrete set. For INT4 quantization, you round each float16 value to the nearest of 16 integer levels. Simple in theory. But in practice, quantizing a vector of floats requires knowing the scale and zero-point of that vector — how to map integers back to floats when you need them.

These constants must be stored alongside the compressed data, in full precision (FP32). For block-wise quantization with block size 64, this adds 32/64 = 0.5 bits per value at FP32. For block size 16, it's 2 bits per value — partially undoing the compression you just performed. This overhead has been the fundamental ceiling on how aggressively you can compress the KV cache.

Traditional Block-Wise KV Quantization — Effective Bit Rate

Target: compress KV vectors to b bits per element
But per-block scale requires FP32 (32 bits) stored per block of size g:

effective_bits = b + 32 / g ← scale overhead per element

g = 64: effective_bits = 4 + 0.50 = 4.5 bits (target was 4.0)
g = 16: effective_bits = 4 + 2.00 = 6.0 bits (target was 4.0 — 50% overhead!)

TurboQuant overhead: 0 bits (zero quantization constants to store)

TurboQuant's key architectural insight: if you can transform the data so its distribution is known in advance, you don't need to store any quantization constants at all. The quantizer can be precomputed once offline and applied universally. This is exactly what PolarQuant achieves through random rotation.

KV Cache Memory Growth vs Context Length

KV cache memory grows linearly with context length and model size — it's the dominant memory cost in long-context inference. TurboQuant 4-bit compresses the entire shaded region by 6×.

The TurboQuant Pipeline

Two-stage compression — PolarQuant for the bulk, QJL for the residual

Stage 1: PolarQuant Stage 2: QJL (1 bit) Zero overhead Training-free

TurboQuant works by splitting the compression budget into two stages. The first stage (PolarQuant) handles the bulk of compression — typically b−1 bits — capturing the dominant structure of each vector with near-zero overhead. The second stage (QJL) uses the remaining 1 bit to correct the residual error left by the first stage. Together they achieve optimal rate-distortion with zero memory overhead.

TurboQuant — Full Pipeline for one KV vector

Input KV vector x ∈ ℝ^d (FP16, d=128 per head)

↓ Stage 1: PolarQuant

Random orthogonal rotation Π·x

→

Pair coordinates → polar form (r, θ)

→

Lloyd-Max quantize θ using (b−1) bits

↳ Quantization levels precomputed offline — zero constants stored at runtime

↓ Residual error

Residual ε = x − x̂ (what PolarQuant missed)

→

QJL: project onto random sign vector → 1 bit

↳ Eliminates bias in residual — zero extra memory overhead for 1-bit sign

↓ Output

Compressed KV vector: (b−1) bits from PolarQuant + 1 bit from QJL = b bits total

Total overhead: 0 bits of quantization constants stored

The key design decision is how to split the budget: PolarQuant takes b−1 bits because it captures the majority of the vector's information — the angles in polar coordinates follow a known, compressible distribution. QJL takes exactly 1 bit because even a single sign bit of the residual, processed correctly through the JL estimator, eliminates the quantization bias that would otherwise accumulate across the attention computation.

TurboQuant — Visual Architecture Flowchart

Left path: PolarQuant compresses the bulk of each KV vector using (b−1) bits with zero runtime overhead. Right path: QJL contributes 1 residual-correction bit. Both paths combine to produce a b-bit compressed vector.

PolarQuant

Random rotation + polar coordinates = a distribution you can precompute a quantizer for

AISTATS 2026 Cartesian → Polar Lloyd-Max optimal Zero runtime overhead

Step 1 — Random Orthogonal Rotation

The core problem with quantizing arbitrary vectors is that their values can be distributed any way — skewed, heavy-tailed, clustered. A quantizer optimal for one distribution is terrible for another. PolarQuant sidesteps this by first applying a random orthogonal rotation to the vector.

Random Rotation — Why It Works

Let Π be a random orthogonal matrix (d × d), pre-generated offline
Apply rotation to input vector x:
x̃ = Π · x ← rotated vector, same norm, spread uniformly

Key property (Johnson-Lindenstrauss): after random rotation,
each coordinate x̃_i follows approximately N(0, ‖x‖²/d)
→ distribution is now KNOWN and PREDICTABLE regardless of input

Use QR decomposition to build Π (once offline, reuse forever):
G ~ N(0, I_d×d) Π, _ = QR(G)

After rotation, every coordinate follows approximately the same Gaussian distribution regardless of the original vector's structure. This is the crucial enabler: you can now design a quantizer offline, once, for this known distribution, and apply it to every vector at runtime without storing any per-vector constants.

Step 2 — Polar Coordinate Representation

Rather than quantizing coordinates independently (Cartesian quantization), PolarQuant converts pairs of rotated coordinates (x̃₂ᵢ, x̃₂ᵢ₊₁) into polar form. The key insight: the angular component θ has a much more concentrated, predictable distribution than the raw Cartesian values — making it far more compressible.

PolarQuant — Polar Coordinate Quantization

For each pair of rotated coordinates (x̃₂ᵢ, x̃₂ᵢ₊₁):
r_i = √(x̃₂ᵢ² + x̃₂ᵢ₊₁²) ← radius (magnitude of pair)
θ_i = atan2(x̃₂ᵢ₊₁, x̃₂ᵢ) ← angle in [−π, π]

After random rotation: θ_i is uniformly distributed on [−π, π]
→ optimal quantizer for uniform distribution = uniform grid!

Quantize angle with (b−1) bits:
θ̂_i = round(θ_i · 2^b−1 / (2π)) · (2π / 2^b−1)

Reconstruct: x̂₂ᵢ = r_i·cos(θ̂_i), x̂₂ᵢ₊₁ = r_i·sin(θ̂_i)
Note: radius r_i is NOT quantized — it cancels in attention (K·Qᵀ normalises)

📐

Why the radius doesn't need to be stored: In the attention mechanism, you compute softmax(Q·Kᵀ / √d). The softmax operation is scale-invariant — multiplying all logits by a constant doesn't change the output. So the radius r_i, which scales the key vector, cancels out in the attention score computation. PolarQuant stores only the angle, which determines the direction of the vector. This is the core trick that eliminates all storage overhead.

Cartesian vs Polar Representation — Why Polar Wins

Left: raw coordinates clustered unevenly — quantizing them requires per-vector scale constants. Right: after random rotation and polar conversion, angle θ is uniformly distributed — optimal quantizer computable offline once, reused for all vectors.

For a uniform angular distribution (which PolarQuant guarantees after rotation), the optimal quantizer is simply a uniform grid — equally spaced levels across [−π, π]. This can be computed analytically once and hardcoded. For non-uniform distributions (such as the Beta distribution that arises in some head configurations), PolarQuant uses the Lloyd-Max algorithm to find the optimal level placements. Either way, the quantizer is computed offline and never changes — zero runtime overhead, optimal distortion.

PolarQuant — Compression at Different Bit Widths

Angle quantization error vs bit width. 4-bit PolarQuant achieves near-lossless angular precision — error is far below the noise floor of FP16 arithmetic rounding.

QJL — Quantized Johnson-Lindenstrauss

One bit of residual correction that eliminates quantization bias entirely

AAAI 2025 1 bit per vector Bias-free estimator JL Transform

After PolarQuant compresses the vector using b−1 bits, there is a small residual error ε = x − x̂. If ignored, this error introduces a systematic bias into the attention score — every dot product Q·K is shifted by a small but consistent amount that accumulates across all attention heads and all tokens, degrading model quality.

QJL eliminates this bias with a single additional bit per vector. The insight comes from the Johnson-Lindenstrauss lemma: a random projection preserves inner products in expectation. You don't need to store the full residual — just its sign along a random direction.

QJL — Quantized Johnson-Lindenstrauss Transform

Residual after PolarQuant:
ε = x − x̂ ← small but introduces bias in Q·K

QJL: project residual onto random sign vector g ∈ {±1}^d:
b_ε = sign(gᵀ · ε) ← just 1 bit stored per KV vector

At query time, unbiased attention score estimator:
Q·K ≈ Q·x̂ + √d · b_ε · (gᵀ · Q) ← corrected dot product

Why it works: E[b_ε · (gᵀ·Q)] = E[sign(gᵀ·ε)·(gᵀ·Q)]
= (2/π)·‖ε‖·‖Q‖·cos(angle) — unbiased estimate of ε·Q

Memory cost: 1 bit per KV vector (vs 32·d bits for full residual)

🔬

The estimator trick explained plainly: You want to know gᵀ·ε (the inner product of the residual with the query). You don't store ε — you only store its sign along a random direction g. The JL estimator says: if you know the sign of the projection and the magnitude of the query along the same direction, you can construct an unbiased estimate of the original inner product. The estimate has variance, but zero mean error. Over the softmax in attention, this variance is harmless — the softmax is robust to small symmetric noise. The bias from PolarQuant, however, would not cancel — it would shift all attention scores in one direction. QJL's purpose is solely to eliminate that systematic drift.

Why 1 Bit Is Enough — And Not More

The residual ε after PolarQuant is already small — PolarQuant captures the dominant structure of the vector. What remains is high-frequency, near-random noise. The JL lemma tells us that for random noise, a single bit of sign information along a random direction provides the maximum information gain per bit — additional bits on the residual have strongly diminishing returns. This is why TurboQuant's optimal split is always (b−1) bits for PolarQuant + 1 bit for QJL, regardless of the total budget b.

QJL — 1-Bit Residual Correction Mechanism

PolarQuant leaves a small residual error ε. QJL projects ε onto a random sign vector g (stored offline) and records only the sign — 1 bit. At query time, this single bit corrects the bias in the dot product Q·K without storing any additional data per vector.

TurboQuant — Animate Full Pipeline

A KV vector enters at FP16. PolarQuant captures direction with (b−1) bits. QJL adds 1 residual correction bit. Output: b-bit compressed vector with zero stored constants.

The Numbers

What TurboQuant actually achieves on Gemma, Mistral, and vector search

H100 GPU benchmarks JAX baseline Gemma · Mistral

Attention Logit Speedup vs FP32 Baseline — H100 GPU

Measured on NVIDIA H100 GPUs in JAX. TurboQuant 4-bit achieves 8× speedup vs FP32 unquantized keys. Even 3-bit delivers 6×. Compared against a highly optimised JAX FP32 baseline.

KV Cache Memory — 70B Model at 128K Context

At 128K tokens, FP16 KV cache requires ~140 GB. TurboQuant 4-bit reduces this to ~35 GB — fitting on a single 80 GB A100. At 3-bit, under 27 GB — fits on a single H100.

Accuracy on LongBench / Needle-in-Haystack

Accuracy was evaluated on Gemma and Mistral using LongBench (multi-task long-context understanding) and the "needle in a haystack" retrieval task — where a key fact is hidden deep in a long context and the model must retrieve it exactly. PolarQuant at 4-bit was near-lossless on both. At 3-bit, quality starts degrading noticeably on models smaller than 8B, but remains within acceptable bounds on 7B+ models.

Key Sensitivity: Values vs Keys

Community experiments have found an important asymmetry: value quantization is the bottleneck, not key quantization. At 2-bit values, cosine similarity degrades to ~0.94 — noticeable quality loss. At 4-bit values, cosine similarity is 0.997 — essentially FP16 quality. Keys are less sensitive because they interact through the softmax, which is robust to small perturbations. Practical recommendation: use 4-bit for values, 3-bit for keys if budget is tight.

Vector Search — GloVe Benchmark

Beyond LLM inference, TurboQuant was evaluated as a general-purpose vector search algorithm on the GloVe dataset (d=200, 1.2M vectors). It achieved superior recall ratios across top-k retrieval tasks compared to both Product Quantization (PQ) and RaBitQ — without requiring large codebooks, dataset-specific tuning, or any training data. This matters because vector search underpins Google Search, YouTube recommendations, and advertising targeting.

How It Stacks Up

TurboQuant vs the Field

Five categories of competing approaches — each with different tradeoffs. TurboQuant is the first to combine zero overhead, training-free operation, and near-optimal rate-distortion in one method.

Method	Bits	Overhead	Training?	Vector Search?	Accuracy	Speedup
FP16 (baseline)	16 bits	0	No	✅	Reference	1×
vLLM FP8 KV	8 bits	Low	No	❌	Near-lossless	~1.5×
GPTQ (weight quant)	4 bits	Group scale (FP16)	Calibration	❌	~0.5–1% loss	Memory only
KVQuant	4 bits	Per-vector scale	Calibration	❌	Good on 7B+	2–3×
Ollama q4_0/q8_0	4–8 bits	Block scale	No	❌	Unpredictable	Memory only
Product Quant. (PQ)	Variable	Large codebook	✅ Required	✅	Good	3–4×
RaBitQ	1 bit	Norm storage	No	✅	Lossy at low bits	5–6×
TurboQuant (4-bit)	4 bits	Zero	No	✅	Near-lossless	8×
TurboQuant (3-bit)	3 bits	Zero	No	✅	Good on 7B+	6×

Where TurboQuant Wins — Uniquely

🏆

Zero Overhead

The only 4-bit KV method that stores literally zero quantization constants. Every competing method stores at least 0.5 extra bits per element for scales or codebooks. TurboQuant: 0.

🔌

Works for Both KV + Vector Search

FP8, GPTQ, KVQuant — all LLM-specific. TurboQuant is a general vector quantization algorithm that also beats PQ and RaBitQ on the GloVe benchmark. One method, two massive use cases.

📐

Near Information-Theoretic Optimal

TurboQuant operates near the Shannon lower bound for quantization distortion. Most methods are heuristics. This is a theorem — provably near-optimal rate-distortion for the given bit budget.

⚡

Hardware-Friendly

The polar coordinate representation maps naturally to efficient bitwise operations. 4-bit attention logits on H100 achieve 8× speedup — not just memory savings but actual compute acceleration.

Where TurboQuant Has Limits

Small Models (Under 3B) Need Care

At 3-bit, models below 3B parameters show noticeable output degradation — repetitive or incoherent text. Smaller models have less redundancy in their KV representations, making them more sensitive to quantization noise. Community experiments confirm 4-bit is safe down to ~1B; 3-bit should only be used on 7B+ models. This is not a fundamental limitation of TurboQuant's algorithm — it reflects the information content of small model activations.

No Official Implementation Yet — Community is Moving Fast

As of March 2026, there is no official Google open-source release. The paper provides theory and pseudocode. However: community integration into llama.cpp is tracked in issue #20969; MLX experiments on Apple Silicon report ~5× compression with 99.5% quality retention; vLLM has an open feature request for native TurboQuant KV quantization. Official code is expected Q2 2026.

🔬

My take: TurboQuant is the most theoretically rigorous KV cache compression method published to date. The combination of zero overhead, training-free operation, and near-optimal rate-distortion is genuinely novel. What makes it important is not just that it compresses more — it's that it establishes where the limit is. KV cache compression is approaching its information-theoretic boundary. The next major efficiency gains in LLM inference will have to come from somewhere else: attention architecture changes (like Mamba hybrid models), memory hierarchy design, or fundamentally different ways of representing context. TurboQuant doesn't just solve a problem — it closes the chapter on this particular problem.

Every token you generate stores 16-bit vectors.
They add up fast.

6× Memory Reduction

8× Attention Speedup

No Training Needed

Near Information-Theoretic Optimal

Why 1 Bit Is Enough — And Not More

Key Sensitivity: Values vs Keys

TurboQuant vs the Field

Zero Overhead

Works for Both KV + Vector Search

Near Information-Theoretic Optimal

Hardware-Friendly

Small Models (Under 3B) Need Care

No Official Implementation Yet — Community is Moving Fast

TurboQuant

Every token you generate stores 16-bit vectors.They add up fast.

6× Memory Reduction

8× Attention Speedup

No Training Needed

Near Information-Theoretic Optimal

Why 1 Bit Is Enough — And Not More

Key Sensitivity: Values vs Keys

TurboQuant vs the Field

Zero Overhead

Works for Both KV + Vector Search

Near Information-Theoretic Optimal

Hardware-Friendly

Small Models (Under 3B) Need Care

No Official Implementation Yet — Community is Moving Fast

Every token you generate stores 16-bit vectors.
They add up fast.