Alibaba unveils the network and datacenter design it uses for large language model training

Alibaba logo
(Image credit: Shutterstock / monticello)

Alibaba has revealed its datacenter design for LLM training, which apparently consists of an Ethernet-based network in which each host contains eight GPUs and nine NICs that each have two 200 GB/sec ports.

The tech giant, which also offers one of the best large language models (LLM) around via its Qwen model, trained on 110 billion parameters, says this design has been used in production for eight months, and aims to maximize the utilization of a GPU's PCIe capabilities increasing the send/receive capacity of the network. 

Another feature that increases speed is the use of NVlink for the intra-host network providing more bandwidth between hosts. Each port on the NICs is connected to a different top-of-rack switch avoiding a single point of failure a design that Alibaba call rail-optimized.

Each pod contains 15,000 GPUs

A new type of network is required because the traffic patterns in LLM training is different from general cloud computing because of low entropy and bursty traffic. there is also a higher sensitivity to faults and single point failures.

"Based on the unique characteristics of LLM training, we decided to build a new network architecture specifically for LLM training. We should meet the following goals; scalability, high performance, and single-ToR fault tolerance," the company said.

Another part of the infrastructure that was revealed was the cooling mechanism. As no vendors could provide a solution to keep chips below 105C, the temperature at which switches begin to shut down, Alibaba designed and created its own vapor chamber heat sink along with using more wicked pillars at the center of chips carrying heat away more efficiently.

The design for LLM training is encapsulated in pods that contain 15,000 GPUs and each pod can be located in a single datacenter. "All datacenter buildings in commission in Alibaba Cloud have an overall power constraint of 18MW, and an 18MW building can accommodate approximately 15K GPUs. In conjunction with HPN, each single building perfectly houses an entire Pod, making predominant links inside the same building." Alibaba wrote.

Alibaba also wrote it expects model parameters to continue to rise by an order of magnitude in the next several years from one trillion to 10 trillion parameters, and that its new architecture is planned to be able to support this and increase to a scale of 100,000 GPUs.

Via The Register

More from TechRadar Pro

James Capell
B2B Editor, Web Hosting

James is a tech journalist covering interconnectivity and digital infrastructure as the web hosting editor at TechRadar Pro. James stays up to date with the latest web and internet trends by attending data center summits, WordPress conferences, and mingling with software and web developers. At TechRadar Pro, James is responsible for ensuring web hosting pages are as relevant and as helpful to readers as possible and is also looking for the best deals and coupon codes for web hosting.

Read more
Nvidia H800 GPU
A look at the unbelievable Nvidia GPU that powers DeepSeek's AI global ambition
SambaNova runs DeepSeek
Nvidia rival claims DeepSeek world record as it delivers industry-first performance with 95% fewer chips
Cerebras WSE-3
DeepSeek on steroids: Cerebras embraces controversial Chinese ChatGPT rival and promises 57x faster inference speeds
DeepSeek
Nvidia out? DeepSeek pairs with banned Chinese tech giant to deliver unbelievably low pricing on AI inference which could cause Nvidia's house of cards to come crashing
Leaseweb boosts AI-focused services with the inclusion of Nvidia GPU solutions
A SCUBA Diver Checks An Undersea Cable
China 'sinks' 400 servers equivalent to 30,000 gaming PCs as it powers ahead with massive underwater data center project - but I wonder what GPU they use
Latest in Website Hosting
cybersecurity
What's the right type of web hosting for me?
A cloud symbol imposed over a bank of servers in a data center.
What is cloud hosting and who needs it?
Minecraft game server hosting for streamers heading - The Minecraft logo above a Minecraft landscape.
I tried 15 hosts for streaming and hosting Minecraft games and these are the best
Dark web scanning on a laptop
Hostinger integrates dark web scanning into hPanel
WordPress
WordPress Foundation bid for greater trademark control halted, adding to more legal setbacks for CEO Matt Mullenweg
The PebbleHost website.
PebbleHost review
Latest in News
L-mount alliance
Sirui joins L-Mount Alliance to deliver its superb budget lenses for Leica, DJI, Sigma and Panasonic cameras
Security padlock and circuit board to protect data
Trust in digital services around the world sees a massive drop as security worries continue
Samuel and Romy standing very close together in A24's Babygirl movie
Everything new on Max in April 2025, including A24's Babygirl and The Last of Us season 2
An AMD Radeon RX 9070 XT made by Sapphire on a table with its retail packaging
AMD’s secret weapon against Nvidia seems to be stock – way more RX 9070 GPUs are rumored to be hitting shelves than RTX 5000 models
Hacker silhouette working on a laptop with North Korean flag on the background
North Korea unveils new military unit targeting AI attacks
Seth Milchick and Kier Eagan's animatronic speaking in Severance season 2 episode 10
Apple TV+ announces Severance has been renewed for season 3 after that devastating finale