Microsoft lifts the lid on plans for 'planet-scale' AI infrastructure

A digital face in profile against a digital background.
(Image credit: Shutterstock / Ryzhi)

Microsoft has revealed it is working on a new “planet-scale” scheduling system for AI workloads, called Singularity.

As explained in a technical paper published by the firm, Singularity is a “novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads to drive high utilization without impacting their correctness or performance across a global feel of AI accelerators”.

In non-technical speak, this means the system is designed to help ensure that the company’s global network of server hardware is utilized in the optimal manner, thereby cutting the costs associated with running AI workloads.

Microsoft Singularity

At the heart of the Singularity value proposition is the ability to resize jobs in mid-flow, as well as to shift them between different infrastructure located across the globe.

As explained in the paper, a live job can be migrated over to a different cluster or data center and resumed at the precise point at which it left off, thereby optimizing capacity usage. It can also be scaled elastically up or down, taking advantage of a varying number and type of AI accelerators as required.

The beauty of this system, Microsoft says, is that it requires no additional work on the part of developers, because no code modifications are required for Singularity to function.

To make all this possible, however, Microsoft had to find a way to decouple workloads from the hardware resources. The novel solution utilizes something the company is calling a “device proxy”, which runs in its own address space and establishes a layer of separation that allows for fluid reallocation of resources.

“Singularity achieves a significant breakthrough in scheduling deep learning workloads, converting niche features such as elasticity into mainstream, always-on features that the scheduler can rely on for implementing stringent SLAs,” wrote Microsoft, in its summation.

“With novel mechanisms that make unmodified jobs preemptible and resizable with negligible performance overhead, Singularity enables unprecedented levels of workload fungibility, making it possible for jobs to take advantage of the spare capacity anywhere in the globally-distributed fleet.”

Although the scheduling service is the predominant focus of the paper, the authors state that the system is designed to scale across a fleet of hundreds of thousands of GPUs and other AI accelerators.

TechRadar Pro has asked Microsoft when it expects Singularity to become commercially available.

Joel Khalili
News and Features Editor

Joel Khalili is the News and Features Editor at TechRadar Pro, covering cybersecurity, data privacy, cloud, AI, blockchain, internet infrastructure, 5G, data storage and computing. He's responsible for curating our news content, as well as commissioning and producing features on the technologies that are transforming the way the world does business.