Metrics that matter: FLOPS, MFU, tokens-per-watt, tokens-per-dollar
Pick the wrong metric and you optimise the wrong thing. The metric a frontier lab cares about is not the metric a chip vendor publishes. The metric that predicts revenue is not the metric that predicts unit economics. This is the vocabulary.
At any given moment, one of four resources is bottlenecking the system: FLOPS, memory bandwidth, memory capacity, or network capacity. The right metric is the one that exposes which.
FLOPS: the headline number that often misleads
Floating-point operations per second is the most quoted accelerator metric. A 2026-era frontier GPU advertises petaFLOPS at low precision (FP8, FP4) and lower at full precision (FP32, FP64). The number is real. The number also rarely predicts how fast your workload will actually run.
Why: most AI workloads in 2026 are not FLOPS-bound. They are memory-bandwidth-bound (during decode), memory-capacity-bound (during long-context prefill), or network-bound (during distributed training and inference). FLOPS is necessary, not sufficient.
MFU and why Jensen says it is misleading
Model FLOPS Utilisation is the fraction of advertised FLOPS your workload actually consumes. Higher MFU sounds better. In May 2026, an internal xAI memo reported the Memphis cluster running at 11% MFU, which was widely interpreted as wasted capacity.
Jensen Huang argued the opposite at Stanford CS153. He wants low MFU, because low MFU means the system is overprovisioned for the work — and overprovisioning is what avoids Amdahl's law when a different resource (memory, network) spikes to 100%. A cluster running at 100% MFU is one that will stall whenever the bottleneck shifts.
The right read: MFU on its own is the wrong KPI. Looking at the four-dimensional bottleneck (FLOPS, memory bandwidth, memory capacity, network) and asking "which one is constraining the workload right now" is the correct frame.
Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)
Tokens-per-watt: the economic metric for inference
For inference at scale, the metric that predicts unit economics is tokens generated per watt-hour of energy. It bundles silicon efficiency, memory bandwidth, network topology, and rack-level engineering into a single number that the buyer actually cares about.
The same chip can deliver wildly different tokens-per-watt depending on rack architecture. NVL72 delivers far more tokens-per-watt than 72 unconnected H100s would, because decode bandwidth is the bottleneck and NVLink72 aggregates memory across the rack. The metric exposes the rack design as much as it exposes the silicon.
Tokens-per-dollar: the economic metric for buyers
Tokens-per-dollar is what an end customer ultimately pays for. It blends tokens-per-watt with electricity cost, capex amortisation, and operating overhead. The same model can have a 5x range in tokens-per-dollar depending on which cloud, which region, and which deployment shape (batched vs interactive, prefill-heavy vs decode-heavy) you choose.
For a frontier lab competing on API price, tokens-per-dollar is the metric that defines whether the unit economics work. For a hyperscaler selling capacity to that lab, it is the metric that defines whether the deal works for them.
Intelligence-side metrics: evals, pass@k, real tasks
On the model side, the equivalent failure mode is over-optimising for benchmark scores. MMLU and GPQA can be saturated. SWE-Bench can be gamed by harness engineering. The metrics worth tracking are evals tied to real tasks: agent task completion rates, end-to-end workflow success, and time-on-task. Jensen put it simply at CS153: you have to design real evals, because otherwise teams optimise the number rather than the capability.
Pere takes the same view across all four tracks: name the bottleneck, pick the metric that exposes it, treat headline numbers (FLOPS, MFU, eval scores) as starting points rather than answers.
Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)