Architecture · How inference works

How inference actually happens: prefill vs decode

A model serving a request does two very different things. Prefill is compute-heavy: read the prompt, build the KV cache. Decode is memory-heavy: generate one token, fetch all the weights, repeat. The asymmetry shapes every chip and rack.

A request has two phases

When you send a prompt to a language model, the model first processes the prompt to build an internal state called the KV cache. This is the prefill phase. The model then generates output tokens one at a time, where each new token depends on every previous one and the cache. This is the decode phase.

Prefill and decode look similar from the outside (both are forward passes through the same network). They have completely different resource profiles. Treating them as the same workload is the most common reason inference systems are poorly sized.

Prefill is compute-heavy

During prefill, the model processes all input tokens in parallel. A 4,000-token prompt means a single forward pass with 4,000 tokens, which means a large matrix multiplication that saturates the tensor cores. FLOPS are the binding resource. Memory bandwidth and capacity matter, but the chip is mostly doing useful arithmetic.

This is why Cerebras (with massive on-chip SRAM and a wafer-scale tensor factory) and Groq's LPU (with deterministic execution and abundant on-chip memory) both compete most strongly on prefill-heavy workloads. Their architectural bets reward the phase that is compute-dense and bandwidth-light.

Decode is memory-heavy

During decode, the model generates one token per step. To generate that single token, the full weight matrix has to be fetched from HBM and multiplied against the KV cache. The compute is trivial — one token's worth — but the memory traffic is enormous. The chip spends most of its time waiting on HBM bandwidth.

This is the workload that drove NVIDIA to NVL72. A single H100 has 80GB of HBM with about 3TB/s of bandwidth. That is not enough for low-latency decode on a large model. Gang 72 GPUs together over NVLink and the aggregate memory bandwidth becomes large enough to feed decode at hyperscale latency. The rack is the chip.

Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)

Disaggregated serving: prefill and decode on different hardware

Because the two phases have opposite resource profiles, the best deployment is often to run them on different hardware. Send prefill to a compute-dense chip with modest memory bandwidth. Send decode to a chip optimised for HBM bandwidth and inter-chip interconnect. Send the KV cache across.

NVL72 supports this natively: a fraction of the rack handles prefill, the rest handles decode, the KV cache hops between them at NVLink speed. The same idea is why Cerebras + NVIDIA hybrid deployments are starting to appear: prefill on Cerebras, decode on NVL72.

Why this matters for valuation

A buyer evaluating an inference offering has to know the workload mix. A coding agent might be 80% decode (long generations, short prompts). A research summariser might be 80% prefill (long documents, short answers). The cost-per-task is wildly different for those two workloads on the same hardware.

The chip vendor pitching "lowest cost per token" is implicitly assuming a workload mix. Asking "for which mix?" is the question that separates a real evaluation from a marketing benchmark.