Back to Infrastructure
Infra · 2 of 6

Why does networking become the next bottleneck?

When models outgrow one accelerator, the training run becomes a communication problem wrapped around a math problem.

Where the binding constraint sits today

The network becomes binding when more chips add more synchronization than useful compute. At that point, capability is limited by fabric bandwidth and topology, not just accelerator count.

Distributed training is synchronized work

Large training runs split model states, activations, optimizer updates, and data across many accelerators. Those accelerators must repeatedly exchange partial results.

Every exchange costs time. Add enough chips and the network can eat the gain from adding more compute.

Scale-up and scale-out are different regimes

Scale-up links connect chips within a tight domain. They are fast, expensive, and physically short. Scale-out networks connect racks and domains. They reach farther and cost more latency.

The boundary between the two is one of the most important architecture decisions in AI infrastructure.

Topology shapes what models are cheap to train

All-to-all fabrics, torus designs, fat trees, dragonfly networks, and optical circuit switches each reward different communication patterns.

This means model architecture and network topology co-design each other. A mixture-of-experts model with heavy expert routing stresses the fabric differently from a dense model with predictable collectives.

Optics matter because copper runs out of room

Inside a rack, copper can carry enormous bandwidth over short distances. Across rows and buildings, optical links become unavoidable.

The faster frontier clusters grow, the more the infra story moves toward co-packaged optics, optical circuit switching, and fabric designs that reduce expensive global communication.

The strategic read is fewer slow crossings

A good cluster is not simply the one with the most accelerators. It is the one that keeps the critical communication path inside the fastest possible fabric for the longest possible time.

That is why networking becomes the next bottleneck after power and chips: it decides whether more silicon behaves like one machine or like a crowd waiting in line.