Azure CTO Mark Russinovich’s annual Azure infrastructure shows at Construct are all the time fascinating as he explores the previous, current, and way forward for the {hardware} that underpins the cloud. This 12 months’s speak was no completely different, specializing in the identical AI platform touted in the remainder of the occasion.
Over time it’s been clear that Azure’s {hardware} has grown more and more complicated. At the beginning, it was a chief instance of utility computing, utilizing a single customary server design. Now it’s many various server sorts, in a position to help all courses of workloads. GPUs have been added and now AI accelerators.
That final innovation, launched in 2023, exhibits how a lot Azure’s infrastructure has advanced together with the workloads it hosts. Russinovich’s first slide confirmed how shortly fashionable AI fashions have been rising, from 110 million parameters with GPT in 2018, to over a trillion in at this time’s GPT-4o. That progress has led to the event of huge distributed supercomputers to coach these fashions, together with {hardware} and software program to make them environment friendly and dependable.
Constructing the AI supercomputer
The dimensions of the programs wanted to run these AI platforms is gigantic. Microsoft’s first huge AI-training supercomputer was detailed in Could 2020. It had 10,000 Nvidia V100 GPUs and clocked in at quantity 5 within the international supercomputer rankings. Solely three years later, in November 2023, the most recent iteration had 14,400 H100 GPUs and ranked third.
In June 2024, Microsoft has greater than 30 comparable supercomputers in knowledge facilities around the globe. Russinovich talked in regards to the open supply Llama-3-70B mannequin, which takes 6.4 million GPU hours to coach. On one GPU that may take 730 years, however with one in all Microsoft’s AI supercomputers, a coaching run takes roughly 27 days.
Coaching is barely half the issue. As soon as a mannequin has been constructed, it must be used, and though inference doesn’t want supercomputer-levels of compute for coaching, it nonetheless wants loads of energy. As Russinovich notes, a single floating-point parameter wants two bytes of reminiscence, a one-billion-parameter mannequin wants 2GB of RAM, and a 175-billion-parameter mannequin requires 350GB. That’s earlier than you add in any needed overhead, resembling caches, which may add greater than 40% to already-hefty reminiscence necessities.
All which means Azure wants loads of GPUS with very particular traits to push via loads of knowledge as shortly as doable. Fashions like GPT-4 require vital quantities of high-bandwidth reminiscence. Compute and reminiscence all want substantial quantities of energy. An Nvidia H100 GPU requires 700 watts, and with 1000’s in operation at any time, Azure knowledge facilities have to dump loads of warmth.
Past coaching, design for inference
Microsoft has developed its personal inference accelerator within the form of its Maia {hardware}, which is pioneering a brand new directed-liquid cooling system, sheathing the Maia accelerators in a closed-loop cooling system that has required a complete new rack design with a secondary cupboard that incorporates the cooling tools’s warmth exchangers.
Designing knowledge facilities for coaching has proven Microsoft the best way to provision for inference. Coaching quickly ramps as much as 100% and holds there at some stage in a run. Utilizing the identical energy monitoring on an inferencing rack, it’s doable to see how energy draw varies at completely different factors throughout an inferencing operation.
Azure’s Venture POLCA goals to make use of this info to extend efficiencies. It permits a number of inferencing operations to run on the identical time by provisioning for peak energy draw, giving round 20% overhead. That lets Microsoft put 30% extra servers in a knowledge heart by throttling each server frequency and energy. The result’s a extra environment friendly and extra sustainable method to the compute, energy, and thermal calls for of an AI knowledge heart.
Managing the info for coaching fashions brings its personal set of issues; there’s loads of knowledge, and it must be distributed throughout the nodes of these Azure supercomputers. Microsoft has been engaged on what it calls Storage Accelerator to handle this knowledge, distributing it throughout clusters with a cache that determines if required knowledge is on the market regionally or whether or not it must be fetched, utilizing accessible bandwidth to keep away from interfering with present operations. Utilizing parallel reads to load knowledge permits massive quantities of coaching knowledge to be loaded nearly twice as quick as conventional file hundreds.
AI wants high-bandwidth networks
Compute and storage are necessary, however networking stays important, particularly with huge data-parallel workloads working throughout many a whole lot of GPUs. Right here, Microsoft has invested considerably in high-bandwidth InfiniBand connections, utilizing 1.2TBps of inner connectivity in its servers, linking 8 GPUs, and on the identical time 400Gbps between particular person GPUs in separate servers.
Microsoft has invested lots in InfiniBand, each for its Open AI coaching supercomputers and for its customer support. Apparently Russinovich famous that “actually, the one distinction between the supercomputers we construct for OpenAI and what we make accessible publicly, is the dimensions of the InfiniBand area. Within the case of OpenAI, the InfiniBand area covers your entire supercomputer, which is tens of 1000’s of servers.” For different clients who don’t have the identical coaching calls for, the domains are smaller, however nonetheless at supercomputer scale, “1,000 to 2,000 servers in measurement, connecting 10,000 to twenty,000 GPUs.”
All that networking infrastructure requires some surprisingly low-tech options, resembling 3D-printed sleds to effectively pull massive quantities of cables. They’re positioned within the cable cabinets above the server racks and pulled alongside. It’s a easy approach to minimize cabling occasions considerably, a necessity while you’re constructing 30 supercomputers each six months.
Making AI dependable: Venture Forge and One Pool
{Hardware} is barely a part of the Azure supercomputer story. The software program stack gives the underlying platform orchestration and help instruments. That is the place Venture Forge is available in. You’ll be able to consider it as an equal to one thing like Kubernetes, a manner of scheduling operations throughout a distributed infrastructure whereas offering important useful resource administration and spreading hundreds throughout several types of AI compute.
The Venture Forge scheduler treats all of the accessible AI accelerators in Azure as a single pool of digital GPU capability, one thing Microsoft calls One Pool. Hundreds have precedence ranges that management entry to those digital GPUs. A better-priority load can evict a lower-priority one, shifting it to a distinct class of accelerator or to a different area altogether. The purpose is to offer a constant degree of utilization throughout your entire Azure AI platform so Microsoft can higher plan and handle its energy and networking price range.
Like Kubernetes, Venture Forge is designed to assist run a extra resilient service, detecting failures, restarting jobs, and repairing the host platform. By automating these processes, Azure can keep away from having to restart costly and complicated jobs, treating them as a substitute as a set of batches that may run individually and orchestrate inputs and outputs as wanted.
Consistency and safety: prepared for AI functions
As soon as an AI mannequin has been constructed it must be used. Once more, Azure wants a manner of balancing utilization throughout several types of fashions and completely different prompts inside these fashions. If there’s no orchestration (or lazy orchestration), it’s straightforward to get right into a place the place one immediate finally ends up blocking different operations. By making the most of its digital, fractional GPUs, Azure’s Venture Flywheel can assure efficiency, interleaving operations from a number of prompts throughout digital GPUs, permitting constant operations on the host bodily GPU whereas nonetheless offering a relentless throughput.
One other low-level optimization is confidential computing capabilities when coaching customized fashions. You’ll be able to run code and host knowledge in trusted execution environments. Azure is now in a position to have full confidential VMs, together with GPUs, with encrypted messages between CPU and GPU trusted environments. You should utilize this for coaching or securing your non-public knowledge used for retrieval-augmented era.
From Russinovich’s presentation, it’s clear that Microsoft is investing closely in making its AI infrastructure environment friendly and responsive for coaching and inference. The Azure infrastructure and platform groups have put loads of work into constructing out {hardware} and software program that may help coaching the most important fashions, whereas offering a safe and dependable place to make use of AI in your functions.
Working Open AI on Azure has given these groups loads of expertise, and it’s good to see that have paying off in offering the identical instruments and methods for the remainder of us—even when we don’t want our personal TOP500 supercomputers.
Copyright © 2024 IDG Communications, Inc.


