The names of boxing's heavyweights are never forgotten – and it's the same with the champs of the supercomputing world.
These machines truly are like no others. Each is computationally more muscular than its predecessor; and for a while, each has claimed the title of the fastest computer in the world.
But, as the calamitous fall of 'Iron' Mike Tyson showed us, champions are built to be felled. And so we've seen supercomputers come and go, growing from single processor machines capable of a few thousand operations per second to systems like IBM's Roadrunner and the Cray XT Jaguar, the latter boasting a massive array of 45,000 AMD Opteron processors.
But exactly why do we need all this power? How much electricity does it take to power a supercomputer? What types of technology will tomorrow's supercomputers use? And how did it all begin? Round one begins now.
The rise of fast machines
For some time, the history of the fledgling supercomputer was the history of computing itself.
At the dawn of the digital age, devices like the Colossus Mark 1 and 2, and ENIAC filled entire rooms. They existed simply to crunch numbers far beyond human abilities. The term 'supercomputer' didn't enter common parlance until the 1960s, and it's often associated with just one famous individual – Seymour Cray. Cray's name is virtually synonymous with the supercomputer.
He started designing machines while working for Control Data Corporation (CDC), a company that had produced the fastest computers in the world for nearly a decade. Cray set himself the goal of creating a computer 50 times faster than the quickest system being sold by CDC at the time, the 48-bit 1604. It took him years, causing some consternation among CDC's management, but in 1964 the CDC 6600 came on the market.
Until the 1960s, computer processing power was measured by how many thousands of operations per second (OPS) a computer could perform. Colossus sported 5,000 OPS, ENIAC 100,000 OPS, and the fastest machine of the 1950s – IBM's catchily named AN/FSQ-7 – still only offered 400,000 OPS. By the time the CDC 6600 arrived, IBM had tripled the speed of its fastest system – the infamous 7030 Stretch – thanks to its adoption of transistors. But the CDC 6600 upped the ante. While Stretch could manage 1.2 MFLOPS (1,200,000 FLOPS), the CDC 6600 was 2.5 times faster, giving 3 MFLOPS. Note also that machines had switched from integer OPS to floating point FLOPS at the turn of the decade.
The leap in processing power given by the CDC 6600 has defined the concept of the supercomputer. Five years later, CDC made an even bigger step forward. The 7600 provided more than 10 times the performance of the 6600, giving 36 MFLOPS, and the trend continued, with the STAR-100 tripling the score in another five years to 100 MFLOPS. Within two years, Seymour Cray had broken away from CDC to form his own company. Its first product, the Cray 1, hit 250 MFLOPS in 1976.
Since then, supercomputer performance has increased by orders of magnitude every decade. The first GFLOP supercomputers (a thousand MFLOPS) arrived in the early 1980s, and the TFLOP level (a thousand GFLOPS) was exceeded by Intel's ASCI Red in 1997.
In 2008, IBM's Roadrunner became the first PFLOP supercomputer, achieving another thousand-fold increase in speed. At the same time, the fastest desktop quad-core processors contained in personal computers are achieving over 50 GFLOPS – the same as supercomputers of the early 1990s.
What makes a computer super?
What made the first true supercomputers so much faster than the previous systems?
The answer really is quite simple: parallelism. The CDC 6600 was still what would be called a single-processor system, with just one central processor (CP). However, this was also assisted by a series of 10 slower peripheral processors (PPs), which ran in parallel.
The CP itself only handled mathematical and logic operations, while the PPs performed all of the memory and input/output tasks. Since the CP was handling a much smaller subset of operations, it could be run faster. The other important element was the switch from thermionic valves (vacuum tubes) to transistors, which offered faster switching speeds. These factors taken together meant that the CDC 6600's CPU could run at 10MHz while other supercomputers of the day were operating at around 1MHz.
Since memory at that time was around 10 times faster than most supercomputer CPUs, the CDC 6600's architecture ensured that operations took full advantage of the bandwidth. The CDC 6600's PPs were each allowed access to the CP for one tenth of the time. So, although these were running slower than the CP, they were able to keep data flowing. The CDC 6600's CP also contained 10 function units internally, which enabled it to work on instructions in parallel. This was the first implementation of a superscalar processor design.
The idea of parallelism has continued to dominate the structure of supercomputers since the CDC 6600. It requires careful programming, mostly because the code has to be split up so that it can run in simultaneous chunks. The next Cray design introduced pipelining, a technique where an instruction unit is broken up into stages so that it can begin work on a new instruction before it has finished the last one.
Superscalar designs with multistage pipelines are now de rigueur in modern desktop processors. VIA's C7 and Intel's Atom are notable non-superscalar exceptions.
Following the vector
The next CDC development introduced another important element that has defined the expansion of supercomputers ever since: vector processing. This technique sees a single operation being performed on multiple data sets at once.
The first system from Seymour Cray's own company – the Cray-1 – used vector processing with the addition of registers. These additions allowed it to apply multiple operations on the same data at once, and necessitated separate vector hardware – something that has been added to desktop CPUs in the form of secondary Single Instruction Multiple Data (SIMD) logic for the last decade.
Vector processing has remained the core structure of supercomputer CPUs. The only major additions have been multiprocessing and clustering, which are different levels of essentially the same thing. Multiprocessing groups multiple CPUs into a single computer (also known as a 'node'), while clustering groups together multiple nodes.
The multiprocessor computers can work on multiple streams of data using vector subsystems, so they are called Multiple Instruction Multiple Data systems. So while different supercomputer companies put a varying number and type of CPUs in each node and use a varying number of nodes in their clusters, the overall approach is almost universal.
The upshot of this is that the CPU design itself is no longer the focus of attention. Instead, manufacturers concentrate on how the CPUs are connected together. For example, non-uniform memory access (NUMA) has become a mainstay in supercomputing, particularly with processor designs that include on-die memory controllers.
In the first few decades of the supercomputer, memory was faster than processors, which was one of the main reasons behind the new design created for the CDC 6600. But nowadays CPUs are faster than memory, and this is even more of a problem if memory is shared across lots of processors.
NUMA alleviates this problem by giving each processor its own local memory. But rather than making this entirely discrete, processors can access each other's local memory. The memory and cache controllers associated with each processor must also communicate to maintain coherency. Otherwise, changes in data held locally would not be recognised when the same data is worked on by another processor. Fast connectivity between processors is therefore a necessity.
The need for speed
Now that we've covered all of the developments in supercomputing over the last five decades, it's probably time to mention why we even needed to build supercomputers in the first place. Put simply, we need them to perform calculations that are beyond our capabilities. The first computers were developed during World War II to execute complex code-breaking calculations that would have taken any human being an incredibly long time to perform.
This has remained the core function of supercomputers: performing complex and usually repetitive algorithms on huge data sets. Supercomputers have found a home in weather centres worldwide, and although their predictions might not always be as accurate as we might like, they do a far better job than we would be able to do without them!
They're also key in more general climate research: without supercomputers, we would probably not have known about global warming. NEC's Earth Simulator was created for precisely this purpose. The amount of data that needs to be processed when considering this global phenomenon is enormous.
Likewise, military problems also often require supercomputers. The current fastest supercomputer in the world – an IBM BlueGene/L, nicknamed Roadrunner, and installed at the Lawrence Livermore National Laboratory in California – works for the US military. Most of its workload is classified, but it is known that much of it involves work on nuclear weapons.
Physics simulation in general is another important application. The ASCI Red was primarily created to provide the level of processing power required for 'full physics' numerical modelling, where all of the data and physical equations of a system can be used in full.
Other scientific applications include chemical and biological molecular analysis, with the latter the particular focus of the Folding@Home project, which turns everyday Internet-connected computers into a supercomputer distributed around the globe. Semiconductor design now also requires the use of supercomputers, so the systems are in effect designing their own future.
In order to benefit from supercomputing, problems must contain a considerable amount of parallelism. If a problem can't be split up in this way, it will be a waste to run it on this kind of exotic hardware.
Fortunately, some tasks lend themselves naturally to parallelism. These tasks are nicknamed 'embarrassingly parallel', and examples include graphics rendering where each pixel can be calculated separately and brute-force code cracking.
Big computers have big problems, however. Thanks to the laws of physics, there is a limit on how fast data can travel: nothing can go faster than light.
For a spread-out system, data will take a fairly large amount of time to move from one processing subsystem to another, placing a ceiling over how fast calculations can occur. The continuing reduction in the size of transistors helps to pack more of them into the same space, so the distances between them will become smaller.
But since supercomputers are now made up of clusters of multiprocessor computers, the communications paths between all the different elements have become the most significant bottleneck. Although the processors in today's supercomputers aren't far off what you find in a desktop, the networking fabric connecting them together remains highly specialised.
A key difference between AMD's Opteron processors, which are aimed at high-performance computing (HPC) usage, and its Athlon 64s (which are aimed at the desktop) is the number of HyperTransport buses available. These buses allow the processors in a node to collaborate more quickly. Intel's Quick Path Interconnect will perform a similar function when Core i7's higher-end Xeon siblings appear in early 2009.
Then there's the need to network nodes together as fast as possible to make the cluster. This requirement has led to the development of HPC-specific networking technologies. Sun used the Scalable Coherent Interface – which is capable of 20Gbps – for many of its supercomputers in the late 1990s, and saw its share of the TOP500 list grow rapidly as a consequence.
But the hunger for ever-increasing network bandwidth is never satiated, leading to the introduction of Infiniband, which can operate at speeds of up to 96Gbps – nearly a hundred times faster than the Gigabit Ethernet used for more general networking. The IBM Roadrunner uses Infiniband to connect its clusters.
A 100Gbps version of Ethernet called 100Gbase-X is also under development. Some supercomputer manufacturers have developed their own proprietary interconnect technology. NEC's IXS Super-Switch technology offers a staggering 256Gbps.
Another perennial problem of performance computing is that processing power also requires electrical power. This means that the more of the former you want, the more of the latter you're going to need.
IBM's valve-based AN/FSQ-7 of the 1950s required as much as 3 MegaWatts – enough to illuminate a small town. The headline figures haven't diminished much over the years, either, with IBM's Roadrunner requiring 2.35MW at peak – although Roadrunner packs in thousands of processors while its predecessor powered just one.
Closely associated with this hunger for Watts is one of its by-products: heat. Cray tackled this situation from the outset, using liquid cooling achieved with Freon and copper cold plates. The company also developed some other novel cooling systems, such as immersing components in electrically inert but highly heat-conductive fluids. This method was used to cool the Cray-2.
But the problem of cooling supercomputers extends far beyond its main internal components. With MegaWatts of electrical power going in, getting the heat away from the circuitry is just the beginning.
The cabinets must be designed with heat dissipation in mind, and the whole architecture of the supercomputer facility must transfer hot air to the outside atmosphere. This generally involves hefty amounts of air conditioning.
Some designs have involved elaborate water-cooling pipes worked through the facility itself, although this has fallen out of favour for cost reasons. Either way, taming the thermal problem is likely to consume a significant amount of electrical power. For example, IBM's ASCI White required 3MW to power its computing tasks, but it required an equal amount of power to cool the system while it was running.
Meeting the challenges
Most of the fastest computers in the world now use similar processors to those found in your desktop PC and even the latest consoles.
Cray's XT Jaguar came close to beating IBM's Roadrunner with a massive array of 45,000 quad-core AMD Opterons. But there is research into new designs that could again increase the power of individual processors by orders of magnitude. For example, CPUs are still resolutely two-dimensional.
Since getting data around the various components is a major issue for massively parallel systems, being able to pack transistors on top of each other as well as side by side promises the kind of leap in performance caused by the integrated circuit itself.
In October 2008, the Interuniversity Microelectronics Centre (IMEC) in Belgium announced a breakthrough in 3D stacking, demonstrating working circuits using its 5μm copper through-silicon vias (Cu-TSV) process. Two 130nm wafers were sandwiched on top of each other, with copper lands bonded together using thermocompression. So, in theory, two quad-core processors could be packaged into the space of a single eight-core processor.
In Japan, electronics firm Unisantis is working on a Stacked-Surrounding Gate Transistor (S-SGT) design, which promises to enable chips with clockspeeds between 20GHz and 50GHz. S-SGT is a bit like perpendicular recording in hard disks, with the transistors arranged vertically rather than horizontally.
This means that more transistors can be packed into the same space, and it reduces both the effects of some of the unwanted physical properties that are encountered when transistors reach a certain level of miniaturisation (such as gate leakage) and the speed limits caused by how far electrons have to travel from gate to gate.
Initial research is revolving around increasing the density of flash memory, which isn't surprising as Fujio Masuoka – who invented flash memory when he was working at Toshiba – is one of the chief proponents. But benefits are expected across all types of silicon products, including CPUs.
Processors hit a clockspeed wall a few years ago, which forced a switch to a parallel multicore approach to boost computing speed instead. But a tenfold increase in frequency would still provide a proportional boost in computing performance.
Optical computers have also been touted as a future replacement for current silicon-based designs. However, photonic transistors would actually require more power than electronic ones. So, in reality, optical computing is not likely to be the future of supercomputing. However, an area where optics do win out is when data rates and distances rise, as less loss of data is incurred compared to electrical lines.
Optical fibre is already the main enabling technology of high-speed telecommunications, and optical Infiniband cabling has been shown to exceed its copper equivalent in performance. Now, optical connections are also starting to be considered for use inside the CPU.
In particular, the Optical Shared Memory Supercomputer Interconnect System (OSMOSIS), a joint project of Corning Incorporated and IBM, aims to create a photonic-switching fabric. This would provide high-speed switching and scheduling of all the CPUs in a massive parallel cluster.
The most recent results demonstrated the fastest optical packet switch in the world, with an aggregate capacity of 2.5Tbps. Another promising possibility for the future of a supercomputer CPU comes from a much more organic source: DNA.
A demonstration in 2002 by researchers from the Weizmann Institute of Science in Rehovot, Israel, showed off a example of DNA computing that gave a performance of 330 trillion OPS. Even now – six years later – this performance places it fourth in the TOP500 list, and astonishingly, this was achieved with a single DNA molecule. However, the technology is currently very limited in the kind of calculations it can perform, and it can only answer 'yes' or 'no' when asked a question.
The system isn't exactly a floating-point cruncher in the manner of traditional supercomputers, and it won't be making its way to a mainframe near you in the near future, but it could well come into its own at some point.
An even more esoteric answer to the problem of building a supercomputer comes from quantum physics. This is still a very new area, but small-scale calculations have been successfully demonstrated using the curious behaviour of matter at the quantum level, in particular entanglement and superposition. With entanglement, two or more objects have linked quantum states, meaning that when one changes, the other performs an identical transformation.
Superposition refers to the probabilistic way in which matter behaves at the quantum level. Taken together, these behaviours theoretically would allow quantum computers to perform calculations an order of magnitude quicker than traditional systems.
PFLOPS in your lounge
However, it's unlikely that any of these new CPU technologies will be making their way into supercomputing over the next few years. Developing a new and amazing processor design is great for the advancement of technology, but it must be realistic.
If the new design is 10 times faster, but a hundred times more expensive than designs derived from mainstream consumer products, then clusters of the latter will have a much more attractive price-performance proposition. This was the main reason why supercomputing hit a brick wall in the early 1990s, a period when many of the former big names were forced into bankruptcy.
Once upon a time, computer technology innovation flowed from the specialised high-end to the generalised consumer. But nowadays volume is a key requirement in order to provide the income necessary for the research and development of a new processor core design.
CPUs are designed with mass appeal first to make them financially viable, but with the ability for HPC derivatives of the processor to be made. For example, of the top 10 fastest supercomputers in the world as of November 2008, none used processors that were custom-designed for the purpose. Instead, AMD Opterons, Intel Xeons and IBM PowerPCs dominate, all of which have closely related consumer equivalents.
The benefits of consumer volume for supercomputing don't stop with CPUs. Since vector performance is so important to floating-point computation, the burgeoning speed of graphics cards also promises further massive leaps in supercomputing power, particularly when harnessed by distributed computing.
The latter is racking up some rather impressive processing scores. The Folding@ Home project had reached a whopping 4.27 PFLOPS by 14 November 2008, making it the fastest supercomputer in the world by a country mile. Most tellingly, over half of this total was contributed by ATI and Nvidia GPUs. But it's also very significant that 1.7 PFLOPS of that total came from Playstation 3 games consoles. In fact, IBM's Roadrunner, which is currently the fastest standalone supercomputer in the world, uses nearly 13,000 cell processors that are closely related in design to the CPU in a Playstation 3.
So the future of supercomputing could be sitting on your desk right now. As the Folding@ Home project shows, distributed computing is already capable of achieving greater performance than the fastest standalone machines. Now that more than half the households in the developed world are online, the fabric of the Internet itself may be the future of the fastest computing on the planet.
Google certainly seems to think so. Its search engines are estimated to have over 300 TFLOPS at their disposal, and with the company getting into the application outsourcing business, maybe it won't be too long before anyone can have their very own slice of a supercomputer.
First published in PC Plus, Issue 278
Now read Secrets of the extreme overclockers
Sign up for the free weekly TechRadar newsletter
Get tech news delivered straight to your inbox. Register for the free TechRadar newsletter and stay on top of the week's biggest stories and product releases. Sign up at http://www.techradar.com/register