When the Workload Stops Moving: Why Custom Silicon Finally Pays

Ask why every hyperscaler is suddenly designing its own AI chip and you get the same three answers, all correct. Nvidia's data-center gross margins run in the mid-70s percent, so building your own strips out a tax measured in billions. You trade a dependency on Nvidia for a dependency on TSMC — but a foundry that sells capacity at roughly 55% gross margin is a gentler landlord than a vendor capturing the full stack. And a chip tuned to one workload can beat a general-purpose GPU on performance per dollar, because you stop paying for silicon you never use.

All true. All beside the point.

Because every one of those facts was also true three years ago, and five years ago. Nvidia's margins were fat. TSMC was a foundry. Inference was specializable. If the case for custom silicon were really about margin, the rush would have started the moment Nvidia's pricing power became obvious — not in the specific window of 2023 to 2026, when Microsoft, Meta, Amazon, and OpenAI all stood up serious chip programs within roughly two years of one another.

So the interesting question isn't why custom silicon. It's why now, and why all at once. And the answer is the one objection the consensus never addresses.

The objection that's actually correct

The strongest pushback on custom silicon is the boring one: economies of scale. Nvidia designs one chip and sells it to everyone. The design cost — easily $100M to $500M for a leading-edge part — amortizes across millions of units. Yields mature because the same silicon runs everywhere. Nvidia ships a new generation roughly every year, and the whole industry's volume funds that cadence.

A hyperscaler going custom gives all of that up. Smaller runs. Worse early yields. A design team that has to re-earn its keep every generation. On paper, going custom means fighting the economics of scale, not exploiting them.

This objection wasn't just valid. For most of the last decade, it was fatal.

Why it used to kill you

A chip takes roughly two years from architecture to deployment. For most of deep learning's history, two years was an eternity. The dominant architecture kept moving underneath you — convolutional nets to early attention to transformers, with a churn of variants inside each. Design a chip around the workload of 2018 and by 2020 you'd taped out an expensive accelerator for a shape of computation the field had already left behind. Half a billion dollars to be precisely optimized for the wrong thing.

That's the real reason custom silicon stayed a Google-only sport for years. Google could justify the first TPU in 2015 not because it was smarter, but because it had a stable, enormous internal workload — ranking for search and ads — worth building fixed hardware around. Almost nobody else had a workload both big enough and stable enough to survive the design cycle.

The diseconomies of scale were real. What changed isn't that they vanished. It's that something finally outweighed them.

What changed: the target stood still

The transformer won. Not just for language — for vision, for code, for audio, for the agentic systems everyone is now racing to ship. The industry converged on a single architecture and then, crucially, stopped. The model of 2026 is recognizably the same shape of computation as the model of 2023: attention, large matrix multiplies, the same memory-bandwidth-bound decode profile. The details move constantly; the silicon-relevant structure barely does.

The moment the workload stood still, the two-year design cycle stopped being a liability. Now you tape out for transformer inference and, two years later, you're still serving transformer inference. The chip arrives optimized for the thing you're actually running. The bet that used to be suicidal became merely expensive — and against a multi-billion-dollar annual GPU bill, merely expensive is a rounding error.

This is also why it happened all at once. Architectural convergence is an industry-wide event, not a company-specific one. The same stabilization that made a custom chip safe for Microsoft made it safe for Meta and Amazon and OpenAI in the same window. They didn't independently rediscover Nvidia's margins — those were never hidden. They crossed the same threshold together: a workload finally stable enough to build fixed hardware around.

And the scale objection gets quietly answered by the same convergence. Hyperscaler inference volume is no longer "small." When one company's internal demand runs into the millions of chips a year, the run-size penalty shrinks — you become your own economy of scale. Specialization covers the rest: every transistor spent on transformer matmul and memory bandwidth, none wasted on the graphics and HPC generality a merchant GPU still has to carry.

Cheaper to buy isn't cheaper to run

Here's the trap in the margin story, and it's where most explanations quietly cheat. A custom chip can be far cheaper to own and still lose on the only number that matters — cost per token — because cost per token is throughput-adjusted. Picture two machines on a factory floor. Nvidia's stamps 30 widgets an hour. Your custom machine costs less, but stamps 25. Unless the price gap more than covers the output you give up, your cheaper machine makes more expensive widgets.

So the right question is never "is the chip cheaper" but "is the token cheaper" — and the answer splits sharply by who's asking.

Google's TPU clears the bar, but look closely at how it clears it, because the mechanism is not the obvious one. A single Trillium-class TPU has lower per-chip throughput than a Blackwell GPU. Per the machine on the factory floor, Google's stamps fewer widgets an hour. It wins anyway, at the system level, for two reasons. First, the dies are cheap — Google buys them from Broadcom at low margin and wraps them in its own boards, racks, and optical fabric, so a TPU v6 rents in the neighborhood of $1.20–2.70 per chip-hour against roughly $2.12–6.50 for a B200. Second, and more importantly, Google doesn't run one machine — it wires hundreds or thousands into a single line. Ironwood, its 2025 inference TPU, addresses a 9,216-chip scale-up domain over an optical circuit switch; the production unit is the pod, not the chip. Slower individual machines, networked cheaply at enormous scale, beat faster machines you have to buy at Nvidia's margin. The per-token win is a property of the whole system, earned by extreme co-design against one workload — not a discount you get for opting out of Nvidia.

Take away either condition and the case weakens fast. Amazon's Trainium is built for batch throughput over latency, and early Trainium2 lagged the H100 on inference speed; its cheaper hardware has not yet converted into a clean per-token win. Microsoft's Maia is widely read as a generation or two behind. Meta's first-generation MTIA underwhelmed badly enough to be effectively skipped. The reason is the part the press release hides: designing the chip is the easy half. The hard half is the compiler, the kernels, and the years of software maturation that let a model actually run well on the silicon — the same stack that makes Nvidia's CUDA a moat. Anthropic is reportedly the only frontier lab that runs fungibly across Nvidia, Google, and Amazon silicon at once, and it took years of internal compiler and orchestration work to get there. That is the real cost of admission, and it is why a chip that looks 25% cheaper on a spec sheet can still lose to simply renting Nvidia.

Even the scoreboard isn't settled. SemiAnalysis's InferenceMAX — the closest thing to an open, total-cost-per-million-tokens benchmark — currently ranks only Nvidia and AMD GPUs, where Nvidia keeps cutting its own cost per token fast through software. The custom ASICs aren't in it yet; TPU Ironwood and Trainium3 are slated to be added later. Until they are, "custom is cheaper per token" is settled for some Google workloads, unproven for Amazon's, and not yet judged on neutral ground.

The takeaway isn't that custom silicon fails. It's that the per-token win is a prize for a specific kind of player — one with a workload big and stable enough to justify obsessive system-level co-design — not a discount anyone collects for designing their own chip.

Where it still breaks

Two more caveats worth pricing in. The convergence bet carries a tail risk: if the architecture genuinely moves again — if something displaces the transformer the way the transformer displaced what came before — every fixed chip built around it inherits the old fatal problem at once. Custom silicon is, in part, a wager that the transformer is a destination, not a way-station. That wager looks good today. It is still a wager.

And even when custom wins, nobody escapes the real chokepoint. The press release says "designed in-house," but the chip is fabbed at TSMC, packaged with CoWoS, and paired with HBM from the same three memory suppliers — SK Hynix, Samsung, Micron — that Nvidia depends on. TSMC's advanced-packaging output is finite and runs a year or more behind demand. You can route around Nvidia's design margin. You cannot route around the physical bottleneck that sits above both of you.

The reframe

So, the clean answer: yes, it's about margin, and yes, you're relocating dependency rather than escaping it — and no, a cheaper chip is not automatically a cheaper token. But the margins and the dependency were always there. What flipped custom silicon from a Google curiosity into an industry-wide program wasn't a financial discovery — it was an architectural one. The margins were the motive. The transformer holding still was the opportunity. And opportunity, not motive, is what explains timing.

Custom silicon didn't become smart. It became safe — for the players with a workload worth building around, and not a moment before.

Sources

InferenceMAX — SemiAnalysis — open inference benchmark normalized to total cost of ownership per million tokens; currently covers Nvidia and AMD GPUs, with TPU Ironwood and Trainium3 slated to be added.
NVIDIA Blackwell on InferenceMAX — vendor framing of Blackwell cost-per-token gains.
Amazon's AI Self-Sufficiency: Trainium2 Architecture & Networking — SemiAnalysis — Trainium2 positioning vs. Nvidia.
Per-chip rental bands (TPU v6 ~$1.20–2.70/chip-hr; B200 ~$2.12–6.50; GB200 ~$10.50–42) and scale-up-domain figures are drawn from the Peregrinations chips registry (2026 compute-economics survey).
Tri-platform fungibility (Anthropic across Nvidia/TPU/Trainium) per Krishna Rao, Invest Like the Best, 2026.

Sources

InferenceMAX — SemiAnalysis — open inference benchmark normalized to total cost of ownership per million tokens; currently covers Nvidia and AMD GPUs, with TPU Ironwood and Trainium3 slated to be added.

NVIDIA Blackwell on InferenceMAX — vendor framing of Blackwell cost-per-token gains.

Amazon's AI Self-Sufficiency: Trainium2 Architecture & Networking — SemiAnalysis — Trainium2 positioning vs. Nvidia.

Per-chip rental bands (TPU v6 ~$1.20–2.70/chip-hr; B200 ~$2.12–6.50; GB200 ~$10.50–42) and scale-up-domain figures are drawn from the Peregrinations chips registry (2026 compute-economics survey).

Tri-platform fungibility (Anthropic across Nvidia/TPU/Trainium) per Krishna Rao, Invest Like the Best, 2026.

When the Workload Stops Moving: Why Custom Silicon Finally Pays

The objection that's actually correct

Why it used to kill you

What changed: the target stood still

Cheaper to buy isn't cheaper to run

Where it still breaks

The reframe

Sources

What is AI actually doing for people?

The Inference Shift: Prefill, Decode, and the End of Waiting

The Mercurial Muse

When the Workload Stops Moving: Why Custom Silicon Finally Pays

The objection that's actually correct

Why it used to kill you

What changed: the target stood still

Cheaper to buy isn't cheaper to run

Where it still breaks

The reframe

Sources

What is AI actually doing for people?

The Inference Shift: Prefill, Decode, and the End of Waiting

The Mercurial Muse