The dirty secret of high performance computing

(Image credit: Free)

In the decades since Seymour Cray developed what is widely considered the world’s first supercomputer, the CDC 6600, an arms race has been waged in the high performance computing (HPC) community. The objective: to enhance performance, by any means, at any cost.

Propelled by advances in the fields of compute, storage, networking and software, the performance of leading systems has increased one trillion-fold since the unveiling of the CDC 6600 in 1964, from the millions of floating point operations per second (megaFLOPS) to the quintillions (exaFLOPS).

The current holder of the crown, a colossal US-based supercomputer called Frontier, is capable of achieving 1.102 exaFLOPS by the High Performance Linpack (HPL) benchmark. But even more powerful machines are suspected to be in operation elsewhere, behind closed doors.

The arrival of so-called exascale supercomputers is expected to benefit practically all sectors - from science to cybersecurity, healthcare to finance - and set the stage for mighty new AI models that would otherwise have taken years to train.

The CDC 6600, widely considered the world's first supercomputer. (Image credit: Computer History Museum)

However, an increase in speeds of this magnitude has come at a cost: energy consumption. At full throttle, Frontier consumes up to 40MW of power, roughly the same as 40 million desktop PCs.

Supercomputing has always been about pushing the boundaries of the possible. But as the need to minimize emissions becomes ever more clear and energy prices continue to soar, the HPC industry will have to re-evaluate whether its original guiding principle is still worth following.

Performance vs. efficiency

One organization operating at the forefront of this issue is the University of Cambridge, which in partnership with Dell Technologies has developed multiple supercomputers with power efficiency at the forefront of the design.

The Wilkes3, for example, is positioned only 100th in the overall performance charts, but sits in third place in the Green500, a ranking of HPC systems based on performance per watt of energy consumed.

In conversation with TechRadar Pro, Dr. Paul Calleja, Director of Research Computing Services at the University of Cambridge, explained the institution is far more concerned with building highly productive and efficient machines than extremely powerful ones.

“We’re not really interested in large systems, because they’re highly specific point solutions. But the technologies deployed inside them are much more widely applicable and will enable systems an order of magnitude slower to operate in a much more cost- and energy-efficient way,” says Dr. Calleja.

“In doing so, you democratize access to computing for many more people. We’re interested in using technologies designed for those big epoch systems to create much more sustainable supercomputers, for a wider audience.”

The Wilkes3 supercomputer might not be the world's fastest, but it's among the most power efficient. (Image credit: University of Cambridge)

In the years to come, Dr. Calleja also predicts an increasingly fierce push for power efficiency in the HPC sector and wider datacenter community, wherein energy consumption accounts for upwards of 90% of costs, we're told.

Recent fluctuations in the price of energy related to the war in Ukraine will also have made running supercomputers dramatically more expensive, particularly in the context of exascale computing, further illustrating the importance of performance per watt.

In the context of Wilkes3, the university found there were a number of optimizations that helped to improve the level of efficiency. For example, by lowering the clock speed at which some components were running, depending on the workload, the team was able to achieve energy consumption reductions in the region of 20-30%.

“Within a particular architectural family, clock speed has a linear relationship with performance, but a squared relationship with power consumption. That’s the killer,” explained Dr. Calleja.

“Reducing the clock speed reduces the power draw at a much faster rate than the performance, but also extends the time it takes to complete a job. So what we should be looking at isn’t power consumption during a run, but really energy consumed per job. There is a sweet spot.”

Software is king

Beyond fine-tuning hardware configurations for specific workloads, there are also a number of optimizations to be made elsewhere, in the context of storage and networking, and in connected disciplines like cooling and rack design.

However, asked where specifically he would like to see resources allocated in the quest to improve power efficiency, Dr. Calleja explained that the focus should be on software, first and foremost.

“The hardware is not the problem, it’s about application efficiency. This is going to be the major bottleneck moving forward,” he said. “Today’s exascale systems are based on GPU architectures and the number of applications that can run efficiently at scale in GPU systems is small.”

“To really take advantage of today’s technology, we need to put a lot of focus into application development. The development lifecycle stretches over decades; software used today was developed 20-30 years ago and it’s difficult when you’ve got such long-lived code that needs to be rearchitected.”

The problem, though, is that the HPC industry has not made a habit of thinking software-first. Historically, much more attention has been paid to the hardware, because, in Dr. Calleja’s words, “it’s easy; you just buy a faster chip. You don’t have to think clever”.

“While we had Moore’s Law, with a doubling of processor performance every eighteen months, you didn’t have to do anything [on a software level] to increase performance. But those days are gone. Now if we want advancements, we have to go back and rearchitect the software.”

CPU with the contacts facing up lying on the motherboard of the PC. the chip is highlighted with blue light — As Moore's Law begins to falter, advances in CPU architecture can no longer be relied upon as a source of performance improvements. (Image credit: Alexander_Safonov / Shutterstock)

Dr. Calleja reserved some praise for Intel, in this regard. As the server hardware space becomes more diverse from a vendor perspective (in most respects, a positive development), application compatibility has the potential to become a problem, but Intel is working on a solution.

“One differentiator I see for Intel is that it invests an awful lot [of both funds and time] into the oneAPI ecosystem, for developing code portability across silicon types. It’s these kind of toolchains we need, to enable tomorrow’s applications to take advantage of emerging silicon,” he notes.

Separately, Dr. Calleja called for a tighter focus on “scientific need”. Too often, things “go wrong in translation”, creating a misalignment between hardware and software architectures and the actual needs of the end user.

A more energetic approach to cross-industry collaboration, he says, would create a “virtuous circle” comprised of users, service providers and vendors, which will translate into benefits from both a performance and efficiency perspective.

A zettascale future

In typical fashion, with the fall of the symbolic exascale milestone, attention will now turn to the next one: zettascale.

“Zettascale is just the next flag in the ground,” said Dr. Calleja, “a totem that highlights the technologies needed to reach the next milestone in computing advances, which today are unobtainable.”

“The world’s fastest systems are extremely expensive for what you get out of them, in terms of the scientific output. But they are important, because they demonstrate the art of the possible and they move the industry forwards.”

Whether systems capable of achieving one zettaFLOPS of performance, one thousand times more powerful than the current crop, can be developed in a way that aligns with sustainability objectives will depend on the industry’s capacity for invention.

There is not a binary relationship between performance and power efficiency, but a healthy dose of craft will be required in each subdiscipline to deliver the necessary performance increase within an appropriate power envelope.

In theory, there exists a golden ratio of performance to energy consumption, whereby the benefits to society brought about by HPC can be said to justify the expenditure of carbon emissions.

The precise figure will remain elusive in practice, of course, but the pursuit of the idea is itself by definition a step in the right direction.

Joel Khalili is the News and Features Editor at TechRadar Pro, covering cybersecurity, data privacy, cloud, AI, blockchain, internet infrastructure, 5G, data storage and computing. He's responsible for curating our news content, as well as commissioning and producing features on the technologies that are transforming the way the world does business.