Alexa now runs on more powerful cloud instances, opening the door for complex new features

Amazon Echo Dot
(Image credit: Amazon)

Amazon's cloud computing voice service Alexa is about to get a whole lot more powerful as the Amazon Alexa team has migrated the vast majority of its GPU-based machine inference workloads to Amazon EC2 Inf1 instances.

These new instances are powered by AWS Inferentia and the upgrade has resulted in 25 percent lower end-to-end latency and 30 percent lower cost compared to GPU-based instances for Alexa's text-to-speech workloads. 

In a press release, AWS technical evangelist Sébastien Stormacq explained why the Amazon Alexa team decided to move from GPU-base machine inference workloads, saying:

“Alexa is one of the most popular hyperscale machine learning services in the world, with billions of inference requests every week. Of Alexa’s three main inference workloads (ASR, NLU, and TTS), TTS workloads initially ran on GPU-based instances. But the Alexa team decided to move to the Inf1 instances as fast as possible to improve the customer experience and reduce the service compute cost.”

AWS Inferentia

AWS Inferentia is a custom chip built by AWS to accelerate machine learning inference workloads while also optimizing their cost.

Each chip contains four NeuronCores and each core implements a high-performance systolic array matrix multiply engine which helps massively speed up deep learning operations such as convolution and transformers. NeuronCores also come equipped with a large on-chip cache that cuts down on external memory accesses to dramatically reduce latency while increasing throughput.

For users wishing to take advantage of AWS Inferentia, the custom chip can be used natively from popular machine learning frameworks including TensorFlow, PyTorch and MXNet with the AWS Neuron software development kit.

In addition to the Alexa team, Amazon Rekognition is also adopting the new chip as running models such as object classification on Inf1 instances resulted in eight times lower latency and doubled throughput when compared to running these models on GPU instances.

TOPICS
Anthony Spadafora

After working with the TechRadar Pro team for the last several years, Anthony is now the security and networking editor at Tom’s Guide where he covers everything from data breaches and ransomware gangs to the best way to cover your whole home or business with Wi-Fi. When not writing, you can find him tinkering with PCs and game consoles, managing cables and upgrading his smart home.