One Paper, $85 Billion Gone: Google TurboQuant and the Jevons Moment for Memory Chips
Google's TurboQuant algorithm compresses AI inference memory 6×, triggering an $85B global memory chip selloff. Wall Street invokes the Jevons Paradox to call it a buying opportunity. SharpPost dissects the technical reality and Google's HBM bargaining play.
compression claimed
market cap erased in 2 days
leading US memory selloff
capacity shortfall
Key Findings
Event: On March 24, Google Research published TurboQuant, an algorithm that compresses large language model KV cache from 16-bit to 3-bit precision during inference, achieving a 6× memory reduction and up to 8× throughput acceleration. The paper has been accepted at ICLR 2026.
Market reaction: US memory stocks plunged on March 25; Asian markets followed on March 26. SanDisk fell 11%, SK Hynix dropped 6.23%, Samsung lost 4.71%, and Micron declined 3.4% (nearly 20% over five days). Major memory chipmakers shed a combined ~$85 billion in market capitalization.
Core assessment: The market priced a narrow-scope paper as if it were a demand-destruction event. TurboQuant compresses only the KV cache during inference. It does not touch training workloads or model weights — and those two categories account for the bulk of HBM demand. This was an emotional selloff driven by a technical misread.
I. The Paper: 3-Bit Precision Magic
On March 24, Google Research published TurboQuant on its official blog — an extreme compression algorithm targeting the key-value cache (KV cache) used during large language model inference. The KV cache stores previously computed results so that the model does not have to reprocess the entire context window every time it generates a new token. TurboQuant compresses each KV cache value from the standard 16-bit representation down to 3 bits, delivering a 6× memory reduction and up to 8× inference throughput improvement on Nvidia H100 GPUs, while matching uncompressed accuracy on all benchmarks.
The technical approach proceeds in two stages. PolarQuant first applies random rotations to data vectors, simplifying their geometric structure so that a standard quantizer can efficiently compress each dimension. A second pass uses QJL, a 1-bit algorithm, to correct residual errors and eliminate quantization bias in attention scores. The paper has been accepted at ICLR 2026, and the open-source community reproduced results quickly — a PyTorch implementation on GitHub reports 5× compression with 99.5% attention fidelity. Silicon Valley has taken to calling it "Google's DeepSeek moment."
In other words, TurboQuant addresses a highly specific problem: how to store longer context windows with less memory during inference. It does not reduce the model's parameter count, it does not lower training-stage compute requirements, and it does not shrink the storage footprint of model weights. This distinction is critical, because the market panic was built squarely on ignoring it.
II. The Selloff: $85 Billion on a Misread
The day after the paper dropped, US memory stocks cratered. SanDisk plummeted 11.02%, leading the sector; Western Digital fell 4.7%, Seagate lost 2.76%, and Micron declined 3.4%. When Asian markets opened on March 26, the panic crossed the Pacific: SK Hynix dropped 6.23% and Samsung Electronics fell 4.71% in Seoul. Across both sessions, major memory chipmakers lost a combined ~$85 billion in market capitalization. The Nasdaq closed down 2.4% that day, dragged in significant part by Meta and Micron.
| Company | Market | Decline | Note |
|---|---|---|---|
| SanDisk | US | -11.02% | Led sector; highest NAND flash exposure |
| SK Hynix | Korea | -6.23% | Core HBM supplier; market feared demand slowdown |
| Samsung Electronics | Korea | -4.71% | Dual DRAM + NAND exposure |
| Western Digital | US | -4.70% | Heavy storage and data-center revenue mix |
| Micron Technology | US | -3.40% | Down nearly 20% over five trading days |
| Seagate Technology | US | -2.76% | Primarily HDD; limited AI storage exposure |
The selloff thesis was straightforward: if AI inference requires 6× less memory, chip demand must be headed for a cliff. The intuition holds at a surface level, but the technical details tell a different story. South Korea's Seoul Economic Daily cited semiconductor analysts estimating TurboQuant's real-world compression at roughly 2.6×, not the headline 6× — because the paper's figures assume ideal laboratory conditions, and production deployment inevitably discounts the ratio. More fundamentally, KV cache accounts for only 15% to 25% of total inference memory; model weights dominate the remainder. Even a 6× KV cache compression translates to roughly 20% total inference memory savings — nowhere near the "demand destruction" that drove the selloff narrative.
III. The Counterargument: Jevons Paradox and Real Demand Structure
Wall Street's response was nearly unanimous in its bullishness. Morgan Stanley's Asia technology research head Shawn Kim was first to invoke the Jevons Paradox: when the efficiency of a resource improves, its unit cost falls, stimulating greater overall consumption so that total usage rises rather than declines. The 19th-century British economist William Jevons observed that improvements in steam engine efficiency did not reduce coal consumption — they made steam power cheap enough to industrialize entire economies. Kim argued TurboQuant follows the same logic: inference costs falling to one-sixth of current levels means models previously confined to expensive cloud clusters can now be deployed to edge devices, and application scenarios previously gated by cost will be unlocked. JPMorgan and Citi echoed similar assessments.
Viewed through this lens, TurboQuant's impact on memory demand decomposes into two dimensions. The first is the direct effect: per-inference KV cache memory consumption declines. That much is certain. The second is the indirect effect: lower inference costs catalyze more deployments, more users, and longer context windows — the Jevons zone, whose magnitude depends on the price elasticity of AI adoption. On the supply side, Samsung, SK Hynix, and Micron have already allocated 70% of new capacity to HBM, and the market still faces a 50% to 60% HBM capacity shortfall. Training-stage HBM demand is entirely untouched by TurboQuant — and training remains the core driver of HBM orders.
The market, in short, repriced an entire sector on a change in one local variable. The pattern is not new. When DeepSeek published its efficiency breakthrough in 2024, AI chip stocks suffered a sharp short-term drawdown before rebounding as sustained demand growth reasserted itself.
IV. The Hidden Dimension: Google's Bargaining Play
The market turbulence triggered by TurboQuant has one dimension that tends to be overlooked: why did Google choose this particular moment to release the paper publicly?
As one of the world's largest operators of AI inference infrastructure, Google spends tens of billions of dollars annually on HBM procurement. SK Hynix is its primary HBM supplier; Samsung is working to close the gap. In a market where HBM supply remains tight and prices elevated, Google publicly demonstrating "we can do the same work with less memory" is fundamentally a bargaining signal to its suppliers: your indispensability is not absolute.
The timing is telling. HBM4 is expected to enter mass production in the second half of 2026, and hyperscale operators — Google, Meta, Microsoft — are actively negotiating HBM4 pricing and allocation priority with memory manufacturers. Releasing a paper that reduces memory dependency at precisely this juncture, regardless of its engineering deployment timeline, gives the buy side measurable leverage at the negotiation table. On the surface, a technical publication. In substance, a procurement tactic.
V. Assessment
TurboQuant is an excellent piece of engineering. It achieves state-of-the-art results in the specific subfield of KV cache quantization. But it is not an event that reshapes memory chip supply-and-demand fundamentals. The $85 billion in erased market capitalization over two days did not price the paper's technical substance — it priced anxiety about "peak AI hardware demand," a narrative that has surfaced repeatedly since 2024 and been disproven each time.
The core demand drivers for memory chips — the compute arms race in AI training and the structural supply shortage of HBM — have not been shaken by an inference-side optimization paper. The $85 billion evaporation was not a technical verdict. It was fear, priced in. For investors, this looks closer to a buying opportunity than an exit signal — provided the Jevons Paradox still holds in the domain of AI inference. Two centuries of evidence, from the steam engine to cloud computing, suggest it almost always does.
读到这里,说明你关注真正重要的事
锐报每周深度分析直送邮箱——财经、地缘、科技,穿透表象。零广告,零废话。