Always-on endpoint AI: Balanced system design

A digital face in profile against a digital background.
(Image credit: Shutterstock / Ryzhi)

Traditional AI hardware design is a matter of careful compromises: compute, memory, and bandwidth must be balanced to avoid becoming bottlenecks. This is complicated because there is no such thing as an average ‘AI workload.’ In reality, neural networks are highly diverse in how they tax these resources, requiring system designers to choose a ‘sweet spot’ compromise or design a niche product.

Endpoint AI introduces power as a further constraint. Power utilization is most impacted by memory bandwidth, followed by compute power utilization.

This paper argues that most AI workloads with enough compute require an NPU, which also needs large external memories and bandwidth.

Carlos Morales

VP of AI at Ambiq.

Larger neural networks mean larger everything

The size of a neural network is roughly a function of the size of its inputs and outputs, the complexity of the task it is accomplishing, and the desired accuracy. Simple tasks, such as recognizing handwritten numbers, have small inputs and outputs and can be performed accurately with very small networks. In contrast, complex tasks, such as ChatGPT, require massive inputs, massive neural networks, and racks of compute.

Always-on endpoint AI workloads

Always-on endpoint AI workloads are defined by their constraints - they operate on locally collected data, must fit in very constrained memory and compute envelopes, and are highly sensitive to power consumption.

That first constraint is often overlooked. By definition, always-on endpoint AI is meant to operate on data collected by local sensors. Typical data sources are multi-variate time-series from biometric, inertial, vibration, environmental sensors, audio data, and vision data. The types of data available to draw insights from inform the types of neural architectures relevant to Endpoint AI and largely dictate its performance and memory needs.

The role of endpoint NPUs

AI workloads need memory (capacity and bandwidth) and compute capacity, which must be balanced to avoid bottlenecks. NPUs accelerate compute without adding memory. While some Endpoint AI workloads benefit from this, most do not. Specifically, we find NPUs useful in the following domains:

1. Real-time complex audio processing: There are complex AI tasks such as sound-specific noise identification (e.g., letting the speech from a specific person through and removing other speakers) that require NPUs because of strict latency limits – in other words, these relatively small models must run every few milliseconds.

2. Real-time video analytics: Real-time AI features such as identifying and tracking multiple objects moving through a video and semantic segmentation require NPUs for video resolutions above VGA.

The race to sleep

Another reason that is often used to justify NPUs is the concept of ‘race to sleep.’ In battery-powered environments, the traditional way of saving power is to stay in sleep mode as long as possible. Recent significant advances in microcontroller power efficiency make the race to sleep less relevant, compelling, or even unnecessary.

Regarding Large Language Models (LLMs)

The best Large Language Models have captured the imagination of the world. In the Endpoint world, this cloud-based artificial intelligence could be used to draw deep insights from the more basic (yet useful) insights that Endpoint AI is capable of.

It is tempting to try moving LLM execution to the endpoint for the same reasons that always-on endpoint AI is valuable: cost, privacy, and robustness. However, the LLMs that are ‘wowing’ the public with their capabilities are massive, requiring the largest compute platforms ever built. With compute needs in the petaflop range, they are not a practical consideration for endpoint devices.

This does not mean that always-on endpoint AI capable of deep insights is not possible, only that LLMs are not the best approach. For limited domains like health analytics, semantic embedding models or distilled foundational models are a much better alternative and yield similar experiences. Because these models are not real-time, they can be implemented without NPUs.

Final thoughts

Always-on endpoint AI is all about analyzing data where it is captured to produce valuable insights. While there are some domains where Endpoint AI compute acceleration is beneficial, the most relevant constraints are power, memory, and the nature of the data. Because of this, most Endpoint AI-enabled features do not benefit from the additional compute, particularly on hyper-power-efficient devices.

We've featured the best AI chatbot for business.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here:

VP of AI, Ambiq

Over 30 years of research and development experience spanning silicon to cloud. Besides AI, his past roles include building expertise in Cloud-based back-end applications, cybersecurity, workload scheduling, orchestration, and isolation, and efficient networking.