What happens when a cluster mixes chip generations?
Heterogeneous training is harder than the GPU count suggests. Inference is mostly fine. The split is why one of the largest AI clusters ever built was leased to a competitor for inference while its owner moved training elsewhere.
When a synchronized training run mixes generations, the slowest chip sets the cadence. The same asset, repurposed for inference, looks completely different. Workload type decides whether mixed silicon is a tax or a feature.
Synchronized training waits for the slowest chip
Distributed training advances in lockstep. Roughly one hundred thousand chips finish one step before any can start the next. The fastest chips wait for the slowest. This is the straggler effect, and it is unavoidable in synchronous gradient-descent training.
Mix three generations of GPU in a single ring (H100 plus H200 plus GB200) and the older silicon sets the cadence for the new. The Blackwell math the analyst paid for is gated by the Hopper math the analyst already owns.
The reported share of theoretical peak compute actually used by xAI sits near 11 percent (The Information, 2026), against 35 to 45 percent at peer labs running homogeneous fleets, and figures characterized at 43 percent for Meta and 46 percent for Google.
NCCL ring topology stops scaling around one hundred thousand chips
NVIDIA Collective Communications Library (NCCL) handles the all-reduce step that synchronizes gradients across GPUs. Its default optimization target is a ring: each chip talks to its neighbor, and gradients travel around the loop once per step.
Ring works at the one thousand to ten thousand GPU scale. At one hundred thousand, the round-trip latency through the ring becomes a meaningful share of step time, and chips spend a growing fraction of every cycle waiting for data instead of doing math. Google sidestepped this constraint with custom optical-circuit-switched topologies. Most of the rest of the industry has not yet made the equivalent move.
Blackwell power dynamics break stacks tuned for Hopper
GB200 chips draw power so aggressively at peak that the silicon now ships with hardware-level power smoothing: programmable ramp rates, minimum power floors, and stop-delay timers. The features exist because bulk-synchronous AI workloads cause large simultaneous power swings that can stress utility transformers and uninterruptible power supplies up the chain by as much as 30 percent of peak grid demand.
A software stack tuned for H100 does not understand these features. A training run that worked beautifully on Hopper can impose load patterns on GB200 that the hardware is now actively trying to mitigate. Until the stack is rewritten, the new chip runs below its envelope, and the cluster operator pays for capacity it cannot fully use.
This is the engineering story behind the financial story. Rewriting a frontier training stack for new silicon is a multi-quarter effort, and during that effort the new hardware does not behave the way the launch slides suggested it would.
Inference parallelizes more forgivingly
The picture flips for inference. Each request can be routed to whichever chip is free. There is no global step boundary. Mixed-generation hardware becomes a heterogeneous fleet that schedules cleanly across a single inference workload, with the H100s and the GB200s each doing what they are individually good at.
When a single tenant occupies all the hardware, multi-tenant network jitter disappears too. The cluster's two structural weaknesses for training (synchronization and tenancy) both stop mattering. The same physical asset can produce 11 percent training MFU and competitive inference throughput in the same quarter.
The asset-rotation playbook
The largest practical example of this distinction in 2026 is xAI's Colossus 1 in Memphis. The cluster mixes an estimated 150,000 H100s, 50,000 H200s, and 20,000 GB200s, totaling roughly 220,000 GPUs and 300 megawatts of capacity (Tom's Hardware, Data Center Dynamics).
In the spring of 2026, xAI moved its training workloads onto Colossus 2, a homogeneous Blackwell build, and leased Colossus 1 in its entirety to Anthropic for inference. Deal-revenue projections sit in the three to six billion dollar annualized range across analyst estimates. The narrative shift this enables, especially in a public-listing window, is "infrastructure tollgate" in place of "research-lab burn rate."
The strategic move underneath the financial one is recasting the cluster, from a struggling training asset into a single-tenant inference asset that prints cash, without changing a single transistor.
The read for operators and investors
For lab operators, the lesson is to keep training fleets homogeneous and to isolate generation transitions on purpose. Mixing within a generation (H200 plus H100, or B200 plus H200) is recoverable when the performance gap is small. The H100-to-GB200 jump is large enough that mixing during a synchronized training run is a measurable tax, not an accounting rounding error.
For infrastructure investors, the lesson is that an underutilized training cluster is a much more valuable inference asset than the headline utilization figure suggests. The chip generation that struggles in synchronized training may parcellate fine for inference, and inference revenue can carry an under-yielding training asset across a bad quarter or a public-listing window.