What is Google TurboQuant?

TurboQuant is an AI inference memory compression algorithm published by Google Research. It compresses the key-value cache (KV cache) used during large language model inference from 16-bit to 3-bit precision, achieving a 6× memory reduction and up to 8× inference throughput improvement with no loss in model accuracy. The paper has been accepted at ICLR 2026.

How much does TurboQuant actually reduce memory chip demand?

TurboQuant only compresses the KV cache during inference, which accounts for roughly 15-25% of total inference memory. Model weights — the dominant memory consumer — are untouched, and training-stage HBM demand is entirely unaffected. South Korea's Seoul Economic Daily cited analysts estimating the real-world compression at approximately 2.6×, not the headline 6× figure. Morgan Stanley, JPMorgan, and Citi argue the Jevons Paradox will convert efficiency gains into greater total demand.

Why are analysts recommending buying memory chip stocks after the selloff?

Three reasons: First, TurboQuant's compression scope is narrow and does not touch the primary drivers of HBM demand (training and model weights). Second, HBM capacity shortfalls remain at 50-60%, and supply constraints will not ease in the near term. Third, the Jevons Paradox — lower inference costs will unlock new AI deployment scenarios, ultimately increasing total memory consumption.

Google TurboQuant Triggers $85B Memory Chip Selloff

SharpPost · Deep Analysis

A single Google paper wiped $85 billion off global memory chip stocks. Wall Street analysts responded by telling clients to buy. A battle over efficiency, demand, and the Jevons Paradox.

March 27, 2026 · ruibao.news

6×

KV cache memory
compression claimed

$85B

Global memory chip
market cap erased in 2 days

-11%

SanDisk single-day drop
leading US memory selloff

50-60%

Current HBM
capacity shortfall

Key Findings

Event: On March 24, Google Research published TurboQuant, an algorithm that compresses large language model KV cache from 16-bit to 3-bit precision during inference, achieving a 6× memory reduction and up to 8× throughput acceleration. The paper has been accepted at ICLR 2026.

Market reaction: US memory stocks plunged on March 25; Asian markets followed on March 26. SanDisk fell 11%, SK Hynix dropped 6.23%, Samsung lost 4.71%, and Micron declined 3.4% (nearly 20% over five days). Major memory chipmakers shed a combined ~$85 billion in market capitalization.

Core assessment: The market priced a narrow-scope paper as if it were a demand-destruction event. TurboQuant compresses only the KV cache during inference. It does not touch training workloads or model weights — and those two categories account for the bulk of HBM demand. This was an emotional selloff driven by a technical misread.

I. The Paper: 3-Bit Precision Magic

On March 24, Google Research published TurboQuant on its official blog — an extreme compression algorithm targeting the key-value cache (KV cache) used during large language model inference. The KV cache stores previously computed results so that the model does not have to reprocess the entire context window every time it generates a new token. TurboQuant compresses each KV cache value from the standard 16-bit representation down to 3 bits, delivering a 6× memory reduction and up to 8× inference throughput improvement on Nvidia H100 GPUs, while matching uncompressed accuracy on all benchmarks.

The technical approach proceeds in two stages. PolarQuant first applies random rotations to data vectors, simplifying their geometric structure so that a standard quantizer can efficiently compress each dimension. A second pass uses QJL, a 1-bit algorithm, to correct residual errors and eliminate quantization bias in attention scores. The paper has been accepted at ICLR 2026, and the open-source community reproduced results quickly — a PyTorch implementation on GitHub reports 5× compression with 99.5% attention fidelity. Silicon Valley has taken to calling it "Google's DeepSeek moment."

In other words, TurboQuant addresses a highly specific problem: how to store longer context windows with less memory during inference. It does not reduce the model's parameter count, it does not lower training-stage compute requirements, and it does not shrink the storage footprint of model weights. This distinction is critical, because the market panic was built squarely on ignoring it.

II. The Selloff: $85 Billion on a Misread

The day after the paper dropped, US memory stocks cratered. SanDisk plummeted 11.02%, leading the sector; Western Digital fell 4.7%, Seagate lost 2.76%, and Micron declined 3.4%. When Asian markets opened on March 26, the panic crossed the Pacific: SK Hynix dropped 6.23% and Samsung Electronics fell 4.71% in Seoul. Across both sessions, major memory chipmakers lost a combined ~$85 billion in market capitalization. The Nasdaq closed down 2.4% that day, dragged in significant part by Meta and Micron.

Exhibit 1

Global Memory Chip Stock Reaction to TurboQuant (March 25-26, 2026)

Company	Market	Decline	Note
SanDisk	US	-11.02%	Led sector; highest NAND flash exposure
SK Hynix	Korea	-6.23%	Core HBM supplier; market feared demand slowdown
Samsung Electronics	Korea	-4.71%	Dual DRAM + NAND exposure
Western Digital	US	-4.70%	Heavy storage and data-center revenue mix
Micron Technology	US	-3.40%	Down nearly 20% over five trading days
Seagate Technology	US	-2.76%	Primarily HDD; limited AI storage exposure

Sources: Bloomberg, CNBC, Korea Exchange. Compiled by SharpPost.

The selloff thesis was straightforward: if AI inference requires 6× less memory, chip demand must be headed for a cliff. The intuition holds at a surface level, but the technical details tell a different story. South Korea's Seoul Economic Daily cited semiconductor analysts estimating TurboQuant's real-world compression at roughly 2.6×, not the headline 6× — because the paper's figures assume ideal laboratory conditions, and production deployment inevitably discounts the ratio. More fundamentally, KV cache accounts for only 15% to 25% of total inference memory; model weights dominate the remainder. Even a 6× KV cache compression translates to roughly 20% total inference memory savings — nowhere near the "demand destruction" that drove the selloff narrative.

III. The Counterargument: Jevons Paradox and Real Demand Structure

Wall Street's response was nearly unanimous in its bullishness. Morgan Stanley's Asia technology research head Shawn Kim was first to invoke the Jevons Paradox: when the efficiency of a resource improves, its unit cost falls, stimulating greater overall consumption so that total usage rises rather than declines. The 19th-century British economist William Jevons observed that improvements in steam engine efficiency did not reduce coal consumption — they made steam power cheap enough to industrialize entire economies. Kim argued TurboQuant follows the same logic: inference costs falling to one-sixth of current levels means models previously confined to expensive cloud clusters can now be deployed to edge devices, and application scenarios previously gated by cost will be unlocked. JPMorgan and Citi echoed similar assessments.

Viewed through this lens, TurboQuant's impact on memory demand decomposes into two dimensions. The first is the direct effect: per-inference KV cache memory consumption declines. That much is certain. The second is the indirect effect: lower inference costs catalyze more deployments, more users, and longer context windows — the Jevons zone, whose magnitude depends on the price elasticity of AI adoption. On the supply side, Samsung, SK Hynix, and Micron have already allocated 70% of new capacity to HBM, and the market still faces a 50% to 60% HBM capacity shortfall. Training-stage HBM demand is entirely untouched by TurboQuant — and training remains the core driver of HBM orders.

The market, in short, repriced an entire sector on a change in one local variable. The pattern is not new. When DeepSeek published its efficiency breakthrough in 2024, AI chip stocks suffered a sharp short-term drawdown before rebounding as sustained demand growth reasserted itself.

IV. The Hidden Dimension: Google's Bargaining Play

The market turbulence triggered by TurboQuant has one dimension that tends to be overlooked: why did Google choose this particular moment to release the paper publicly?

As one of the world's largest operators of AI inference infrastructure, Google spends tens of billions of dollars annually on HBM procurement. SK Hynix is its primary HBM supplier; Samsung is working to close the gap. In a market where HBM supply remains tight and prices elevated, Google publicly demonstrating "we can do the same work with less memory" is fundamentally a bargaining signal to its suppliers: your indispensability is not absolute.

The timing is telling. HBM4 is expected to enter mass production in the second half of 2026, and hyperscale operators — Google, Meta, Microsoft — are actively negotiating HBM4 pricing and allocation priority with memory manufacturers. Releasing a paper that reduces memory dependency at precisely this juncture, regardless of its engineering deployment timeline, gives the buy side measurable leverage at the negotiation table. On the surface, a technical publication. In substance, a procurement tactic.

V. Assessment

TurboQuant is an excellent piece of engineering. It achieves state-of-the-art results in the specific subfield of KV cache quantization. But it is not an event that reshapes memory chip supply-and-demand fundamentals. The $85 billion in erased market capitalization over two days did not price the paper's technical substance — it priced anxiety about "peak AI hardware demand," a narrative that has surfaced repeatedly since 2024 and been disproven each time.

The core demand drivers for memory chips — the compute arms race in AI training and the structural supply shortage of HBM — have not been shaken by an inference-side optimization paper. The $85 billion evaporation was not a technical verdict. It was fear, priced in. For investors, this looks closer to a buying opportunity than an exit signal — provided the Jevons Paradox still holds in the domain of AI inference. Two centuries of evidence, from the steam engine to cloud computing, suggest it almost always does.

Key Findings

I. The Paper: 3-Bit Precision Magic

II. The Selloff: $85 Billion on a Misread

III. The Counterargument: Jevons Paradox and Real Demand Structure

IV. The Hidden Dimension: Google's Bargaining Play

V. Assessment

读到这里，说明你关注真正重要的事

相关报道

订阅锐报