What is a Mixture of Experts model?

(Image credit: NPowell/GPTImage1)

Mixture of Experts (MoE) is an AI architecture which seeks to reduce the cost and improve the performance of AI models by sharing the internal processing workload across a number of smaller sub models.

The concept first appeared in a 1991 paper written by Geoffrey Hinton of Toronto university, one of the pioneers of AI. Strictly speaking, these smaller MoE models are not experts, they are simply discreet neural networks which are given sub-tasks to do in order to complete the main task.

The technology is basically using a form of routing to divide task processing into manageable chunks. This is done by pre-training a large language model into a set of smaller neural networks working collaboratively together and guided by a 'traffic cop' network.

As seen in

> Using AI to deal with boring business tasks

> Google Gemini can now handle bigger prompts

> Deepseek and the coming AI explosion

The object of the architecture is to reduce the costs of processing AI tasks by sharing them between smaller components of a large model.

The concept has recently taken on a greater significance due to the launch of the Deepseek model, which deployed an innovative form of this technology to achieve impressive performance gains.

By employing state of the art MoE architecture, the Deepseek team managed to launch a relatively modest sized foundation model which outperformed the best models on the market at the time.

How does MoE work?

Rather than each 'expert' in this architecture having some form of specialization, the reality is each one is a subset of the main model.

A gatekeeper 'expert' is allocated the task of sharing the workload of the user request between the most suitable of these subsets.

As a result there's no need for the full large model to do the processing, which means less computing power needed to complete the task.

There are many different implementations of MoE architecture, from different researchers and organizations.

And while the overarching goal is to save time and money on the computation of AI tasks, the architecture also offers other advantages.

For example, if optimized properly, smaller MoE models can outperform larger models in the same tasks.

MoEs have other advantages. They're faster to train (although typically at a higher cost), and they operate more efficiently, with little to no degradation in accuracy or performance.

In this way users can obtain the benefits of a large dense AI model, without the computational overhead and performance issues that typically come with the traditional architecture.

The downside is, depending on how the experts are configured, the host computer needs more memory in order to maintain the experts for instant use, and the training costs can be higher than with a single dense AI model.

It's important to stress here that the whole field of MoE architecture is very much a work in progress at the moment.

In fact the technology has really only come into its own with mainstream AI applications in the past four years.

The commercial connection

Anthropic with its Claude models, Mixtral and Deepseek have pioneered a lot of important work in the MoE research field.

However almost all of the major foundation model providers such as OpenAI, Google and Meta are using the architecture in one way or another to extract better performance out of their smaller models.

Interestingly enough many of these major companies are exploring the technology using their own proprietary and open source frameworks.

Google with its GShard, FastMoE from Meta and DeepSpeed-MoE from Microsoft, are just a few of the important frameworks being developed at the moment.

Open source

But it's not all dominated by commercial money. Open source AI is also likely to be a huge beneficiary of the rise of MoE technology.

Typically open source models are constrained by smaller budgets and computer resources, and therefore do not have the capacity to compete head on with the huge resources of the major AI providers.

By deploying MoE tech, these smaller models can benefit from improved performance using modest platforms.

For example the huge impact of the Chinese Deepseek model was mainly due to the fact that it was developed and deployed on computer resources which cost a fraction of those in the West.

It seems clear that no matter how the technology develops over time, MoE has a key role to play in the development of AI as we move forward.

The only question is, will it be the West or China which comes out on top of this valuable tech?

Nigel Powell is an author, columnist, and consultant with over 30 years of experience in the tech industry. He produced the weekly Don't Panic technology column in the Sunday Times newspaper for 16 years and is the author of the Sunday Times book of Computer Answers, published by Harper Collins. He has been a technology pundit on Sky Television's Global Village program and a regular contributor to BBC Radio Five's Men's Hour. He's an expert in all things software, security, privacy, mobile, AI, and tech innovation.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.