Architecture · Generations

Chip generations shaped by bottlenecks: Hopper, Blackwell, Vera, Feynman

Each NVIDIA generation was designed to break the bottleneck the previous one exposed. Pre-training needed FLOPS. Decode needed memory bandwidth. Agents need tool-call latency. Swarms of agents need something we are about to find out.

Hopper: built for pretraining

The H100 (Hopper, 2022) was designed for the bottleneck pretraining had at the time: how to multiply matrices at FP16 and FP8 fast enough to consume trillions of tokens of training data. The architecture invested heavily in tensor cores, transformer-engine FP8 support, and NVLink for cross-GPU all-reduce during training.

It worked. Hopper carried frontier pretraining from GPT-3-class to GPT-4-class to the early Claude and Gemini generations. The bottleneck during training was FLOPS, and Hopper delivered more FLOPS per dollar than anything else.

Grace Blackwell NVL72: built for decode

By the time Blackwell shipped (2024-2025), the binding workload had shifted. Inference was now larger than training in dollar terms, and inference is decode-heavy. Decode wants memory bandwidth, not FLOPS. A single Blackwell GPU has more HBM and more bandwidth than Hopper, but the real architectural move was at the rack level.

NVL72 gangs 72 Blackwell GPUs into a single coherent computer with about 130 TB/s of aggregate NVLink bandwidth. The rack-level memory pool can hold the entire KV cache for very large models without going over the network. Decode latency drops by roughly 50x versus the previous generation. The rack is the chip.

Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)

Vera Rubin: built for agents

Vera Rubin (2026, ramping into 2027) is the first generation designed around agentic workloads. Agents are different again. The GPU is no longer the thing waiting — it is the CPU that is waiting, because an agent makes tool calls (run a function, query a database, fetch a page), the tools run on the CPU, and the whole GPU rack stalls until the CPU returns.

NVIDIA's response is the Vera CPU: fewer cores than a cloud CPU, optimised for multi-core single-threaded latency, tightly coupled to the Rubin GPU memory hierarchy. Plus storage that connects directly into the fabric so long-term agent memory does not need to bounce through the network. The whole rack is rebalanced around the CPU's job in an agent loop.

Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)

Feynman: built for swarms

Feynman is the generation after Vera Rubin, and Jensen has been deliberately vague about it. The bet is that production AI by then is not single agents but swarms: thousands of agents with sub-agents, all coordinating, all generating tokens for each other rather than for a human end user.

The compute pattern is open. The early framing is that token traffic between agents will dwarf token traffic to humans, that scheduling will become a first-class architectural concern, and that the network — not the GPU or the CPU — will be the limiting resource. The shape of Feynman will tell you what NVIDIA believes the swarm workload looks like.

The pattern: bottleneck-led architecture

The pattern across four generations is the same. Identify the workload shape one generation ahead. Identify the bottleneck that shape will hit. Design the silicon, the rack, the network, and the software stack against that specific bottleneck before customers know it exists.

This is what codesign produces when extended to forecasting. It is also why pure-chip competitors keep losing ground: they are competing against a roadmap that is already designed for a workload that has not happened yet.

Multidomain vs general-purpose vs overfit

Each generation has to thread a tradeoff. Optimise the chip too tightly for one workload and you can be incredibly fast on that workload, but the addressable market for that exact workload may not be big enough to fund the next round of R&D. Optimise the chip to be good at everything and you end up good at nothing, indistinguishable from the previous generation.

Jensen Huang called the balance artistry at Stanford CS153: "if you build something too overfit for something, you could be incredibly good at it, but the market may not be big enough to fund a sufficiently large R&D. On the other hand, if you're good at everything, then you're good at nothing." This is the implicit tradeoff that shapes whether the next generation looks like a tighter ASIC or a broader accelerator.

It is also the reason ASIC-shaped competitors (Groq, Cerebras, Etched) tend to win on one workload shape and lose on adjacent ones, while NVIDIA stays uncomfortably broad on purpose. The breadth is the strategy.

Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)

Architecture · Generations

Chip generations shaped by bottlenecks: Hopper, Blackwell, Vera, Feynman

Hopper: built for pretraining

Grace Blackwell NVL72: built for decode

Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)

Vera Rubin: built for agents

Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)

Feynman: built for swarms

The pattern: bottleneck-led architecture

Multidomain vs general-purpose vs overfit

Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)