Why does context get expensive so fast?
Long context feels like memory. Under the hood it is a live working set that has to be read, routed, cached, and paid for.
Context becomes binding when the application wants more state than the model can read cheaply at low latency. The constraint is memory movement, not the marketing size of the window.
Context is not free recall
Putting more text into a prompt does not give the model magical memory. It gives the model more material to process at inference time.
The system has to tokenize it, route it through attention, store or recompute cache state, and pay the latency and memory cost.
The KV cache is the hidden bill
During generation, the model keeps key and value states so it can attend back to prior tokens without recomputing everything. That cache grows with context length, layers, heads, and batch size.
At long context, the cache can dominate memory pressure. That is why serving long documents is a systems problem, not just a model feature.
The cache lives on a memory ladder
Every storage tier in a serving stack has a property called drain time: how long it would take to read the whole tier, end to end, at full bandwidth. HBM drains in about 20 milliseconds. DDR drains in seconds. Flash drains in roughly a minute. A spinning disk drains in roughly an hour. Each tier earns its place when its drain time roughly matches how long the data needs to stick around.
API price sheets read cleanly through this ladder. A model that charges one rate for fresh requests, a discounted rate for a five-minute cache, and a deeper discount for a one-hour cache is parking each tier of the cache on the storage that matches its retention window. The five-minute tier is almost certainly flash. The one-hour tier is almost certainly spinning disk.
The point for product design: long context is not free, but it is also not all priced the same. Whether a workflow can keep work alive in the flash tier versus the disk tier decides how the application should structure conversation, retrieval, and re-reads.
Source: Reiner Pope, blackboard inference economics on Dwarkesh Podcast, 2025
Retrieval is cheaper than stuffing
A good retrieval system finds the few pieces of context that matter. A bad one dumps everything into the window and asks the model to sort it out.
The better application design is often to use the model as a reasoner over selected evidence, not as a warehouse for every token the company owns.
Long context changes product design
Agents need memory, but not all memory belongs in the prompt. Stable facts can live in databases. Recent events can live in summaries. High-stakes evidence can be retrieved on demand.
The product question becomes memory architecture: what must be read now, what can be cached, what can be summarized, and what should never enter the model at all.
The constraint pushes down into chips
Long context stresses HBM bandwidth and capacity. That connects the model decision directly to chip economics.
The model designer, infra operator, and application builder are all negotiating the same scarce object: fast memory around the accelerator.