How Netflix technology is improving data centre efficiency

Netflix cloud

What do recommendation engines for Amazon and Netflix have to do with better cloud computing? Thanks to a groundbreaking system from Stanford University, one is inspiring the other.

When faced with their usual workloads, most servers are using around 20 percent of their capacity. One reason why is because cloud service users tend to over-estimate the amount of compute they'll need.

Systems also experience slowdown as the workloads are passed between newer and older physical processing cores or hardware. Applications change, and a code rewrite might impose a bigger load on the server. Other applications that share the workspace might interfere with performance.

Professor Christos Kozyrakis, associate professor of electrical engineering and computer science at Stanford's multi-scale architecture and systems team (MAST), says of the problem; "Everyone has the problem of underperforming servers."

Data centres today are the base stations for countless processes and computations going on all over the networked world at once. If we can successfully schedule application workloads in shared environments as much as possible, everybody benefits.

Whether it's Google, Amazon or your own infrastructure, those workloads are spread across different infrastructures, different locations and sometimes even different providers all at different times according to their needs.

"How do you decide how many resources they need and which resources you give to each application?" Kozyrakis says. "What we tried to do is figure out all this critical information that lets us do a good job of that."

The secret is that every computation you make on a server will run better under certain circumstances than others – newer cores, higher bandwidth back to base, low data burst, etc – and the way to exploit that is in figuring out the particular parameters under which your application will run best.

"You want to run it on every kind of machine you have, with every amount of interference possible and every scale factor to see what happens," Kozyrakis explains. "But to do that you'd have to run it a few thousand times, which is obviously stupid."

Instead, the Stanford system, called 'Quasar', samples a short glimpse of the program in action (often just a few milliseconds) and looks for similarities in other workloads it's already seen. When it has a few matches, the system directs the new application to the best possible infrastructure and scheduling based on the informed guess about how it will perform.

Best guess

If the above description makes the whole process sound a little hit and miss, think of the way heuristic antivirus works by scanning the code of incoming files. If something looks a little too much like something in the database that's already been identified as a cyber-nasty, it's flagged for checking.

Quasar does something similar, but instead of scanning the actual code of an incoming application, it fires it up for long enough to see how it will behave, then checks against a repository of knowledge to find matches.

More importantly (and quite simply), it works. "In the experiments we've done we've increased utilisation from 20 percent up to 60, 70 and in some cases 80," Kozyrakis says.

He hastens to add that raising utilisation on its own isn't difficult – the trick is whether you can do it while maintaining good application performance. "How does it perform with more cores or more memory? How well does an application run when you schedule it on the same machines as others? If you know this stuff you can do a good job of scheduling it."

At first glance it seems the market for such a system would be cloud service providers themselves – if only to assure themselves of the highest possible utilisation while maintaining the best performance for customers.

Kozyrakis isn't sure whether it would work independently as a private cluster management tool or a service layer on top of a product like Azure or AWS, but whatever form Quasar takes in the commercial world, the chance to triple the utilisation of your data centre is enough to make anyone sit up and take notice.