Why does memory bandwidth matter more than FLOPS?
FLOPS measure how fast a chip can do math. Many AI workloads are waiting on the weights, cache, and activations to arrive.
For inference, the binding constraint often shifts from compute to memory bandwidth. The chip with the biggest FLOPS number is not automatically the chip with the best token economics.
The model has to be read before it can answer
Every generated token requires the accelerator to touch model weights, activations, and often a growing key-value cache. The math is fast once the bytes are in the right place. The wait is moving those bytes through the memory hierarchy.
This is the memory wall. It shows up whenever the chip has enough arithmetic capacity but cannot feed it quickly enough.
Prefill and decode stress different machines
Prefill processes the prompt in parallel. It produces large matrix multiplies that keep tensor cores busy. Decode generates one token at a time. It produces skinny operations wrapped around memory reads.
That is why a long-context chat can feel expensive even when the model is not doing much new reasoning. The system is repeatedly reading state to produce one more token.
HBM turns into the price of a token
High-bandwidth memory is scarce, expensive, and physically close to the compute die. In inference, its bandwidth can set tokens per second and therefore cost per useful answer.
A useful rule: if the workload is decode-heavy, read the HBM bandwidth line before the peak FLOPS line. If the workload is prefill-heavy, FLOPS start to matter again.
Software attacks the memory pattern
FlashAttention, paged attention, grouped-query attention, speculative decoding, quantization, and cache compression all share one motive: move fewer bytes, move them more predictably, or reuse them before they leave fast memory.
The algorithmic win and the chip win are joined. Better kernels make the same HBM go further. Better HBM lets the same model serve more tokens before the rack hits its economic limit.
The bottleneck rotates back into models
Once memory bandwidth becomes the constraint, model architecture changes. Smaller active parameter counts, mixture-of-experts routing, lower precision, and retrieval patterns are not just research ideas. They are ways to fit intelligence into the memory budget.
The memory wall is where chips stop being a hardware topic and become a model-design topic.