Models · 5 of 6

Why does context get expensive so fast?

Long context feels like memory. Under the hood it is a live working set that has to be read, routed, cached, and paid for.

Where the binding constraint sits today

Context becomes binding when the application wants more state than the model can read cheaply at low latency. The constraint is memory movement, not the marketing size of the window.

Context is not free recall

Putting more text into a prompt does not give the model magical memory. It gives the model more material to process at inference time.

The system has to tokenize it, route it through attention, store or recompute cache state, and pay the latency and memory cost.

The KV cache is the hidden bill

During generation, the model keeps key and value states so it can attend back to prior tokens without recomputing everything. That cache grows with context length, layers, heads, and batch size.

At long context, the cache can dominate memory pressure. That is why serving long documents is a systems problem, not just a model feature.

Retrieval is cheaper than stuffing

A good retrieval system finds the few pieces of context that matter. A bad one dumps everything into the window and asks the model to sort it out.

The better application design is often to use the model as a reasoner over selected evidence, not as a warehouse for every token the company owns.

Long context changes product design

Agents need memory, but not all memory belongs in the prompt. Stable facts can live in databases. Recent events can live in summaries. High-stakes evidence can be retrieved on demand.

The product question becomes memory architecture: what must be read now, what can be cached, what can be summarized, and what should never enter the model at all.

The constraint pushes down into chips

Long context stresses HBM bandwidth and capacity. That connects the model decision directly to chip economics.

The model designer, infra operator, and application builder are all negotiating the same scarce object: fast memory around the accelerator.