What is AI quantization?

Features
By published

Quantization makes huge models smaller and more usable for general purposes

AI quantization
(Image credit: Future/Flux)

Quantization is a method of reducing the size of AI models so they can be run on more modest computers.

The challenge is how to do this while still retaining as much of the model quality as possible, in other words to prevent response errors or hallucinations.

By shrinking the size with this technique, the models can be deployed on many more computing devices, such as home PCs, smartphones and even tiny appliances.

AS SEEN IN

> Large models on smaller SSDs
> Inferencing and training
> New technology minimizes memory use

Major generative models like OpenAI's GPT, Google's Gemini, and Anthropic’s Claude range are massive data structures which operate using billions or even trillions of parameters.

The reason they need so much power is to cope with a wide range of general purpose applications.

A big clue is the fact that the G in AGI refers to ‘general’ intelligence, because these foundation models have to cope with anything from school homework to advanced scientific calculus.

But there’s a cost to all this power, as you might expect.

These massive models need huge compute resources in order to run - it’s no exaggeration to say they can require data centers the size of a small village, and need energy systems to match.

Quantization is one of the key ways we can reduce those demands, and tailor models for more widespread needs.

How does it work?

ai quantization

(Image credit: Future/NPowell)

Quantization basically reduces the precision of the numbers used in a neural network.

While this sounds like we’re deliberately making it inferior, it's actually an excellent compromise.

Base models typically use 32-bit floating-point numbers (FP32) to represent the weights and biases of their parameters.

By converting these numbers to less precise formats through quantization, for example 16-bit, 8-bit or even 4-bit, we can save a huge amount of physical space on disk and also computer resource requirements.

Remember photo compression?

It’s a lot like compressing a high-resolution photo - the original RAW image might offer stunning detail, but the file size will likely be far too large to share or edit easily.

Using compression tools, we can dramatically reduce these demands, and so make the image more practical to use. Ideally we use a file compression technology like JPEG, which also minimises the loss of detail and color quality, so most people won't notice a difference.

Quantized models similarly sacrifice a small amount of accuracy in exchange for dramatic improvements in utility, size and speed.

It’s safe to say that without these substantial improvements, the world of AI models would be significantly more limited.

Big centralized AI models in huge data centers are great for flagship applications, but AI becomes so much more valuable when distributed to many systems across the globe.

And that’s before we talk about using AI on your smartphone, TV, or other older less powerful devices, all without having to connect to big cloud computers. This has enormous implications for accessibility in regions of the world with limited connectivity or computing resources.

ai quantization

(Image credit: Future/NPowell)

Fun fact: While many people think of quantization as a new technology driven by the AI boom, its roots actually go back to signal processing and information theory from decades ago.

The digital music and photos we've enjoyed for decades rely on similar principles of reducing precision while retaining audio and visual quality.

Quantization techniques are becoming increasingly sophisticated as time passes, allowing even more dramatic compression with less impact on model performance.

One of the major beneficiaries of this improvement is the open source community.

Quantized versions of models like Llama, Mistral and DeepSeek are increasingly powering exciting new applications on personal computers, which would otherwise be impossibly expensive using giant cloud AI services.

Nigel Powell
Nigel Powell
Tech Journalist

Nigel Powell is an author, columnist, and consultant with over 30 years of experience in the tech industry. He produced the weekly Don't Panic technology column in the Sunday Times newspaper for 16 years and is the author of the Sunday Times book of Computer Answers, published by Harper Collins. He has been a technology pundit on Sky Television's Global Village program and a regular contributor to BBC Radio Five's Men's Hour. He's an expert in all things software, security, privacy, mobile, AI, and tech innovation.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Read more
A profile of a human brain against a digital background.
Navigating the rising costs of AI inference in the era of large-scale applications
Half man, half AI.
Yet another tech startup wants to topple Nvidia with 'orders of magnitude' better energy efficiency; Sagence AI bets on analog in-memory compute to deliver 666K tokens/s on Llama2-70B
A person holding out their hand with a digital AI symbol.
Taking AI to the edge for smaller, smarter, and more secure applications
An AI face in profile against a digital background.
BitTorrent for LLM? Exo software is a distributed LLM solution that can run even on old smartphones and computers
A hand reaching out to touch a futuristic rendering of an AI processor.
Researchers want to embrace Arm's celebrated paradigm for a universal generative AI processor; a puzzling MEGA.mini core architecture
Ai tech, businessman show virtual graphic Global Internet connect Chatgpt Chat with AI, Artificial Intelligence.
What is AI? Everything you need to know about Artificial Intelligence
Latest in Pro
Microsoft
"Another pair of eyes" - Microsoft launches all-new Security Copilot Agents to give security teams the upper hand
Lock on Laptop Screen
Medusa ransomware is able to disable anti-malware tools, so be on your guard
AI quantization
What is AI quantization?
US flags
US government IT contracts set to be centralized in new Trump order
An abstract image of digital security.
Fake file converters are stealing info, pushing ransomware, FBI warns
Google Gemini AI
Gmail is adding a new Gemini AI tool to help smarten up your work emails
Latest in Features
iPhone 16 Pro Desert Titanium in hand
I think the rumored iPhone 17 Pro redesign looks great – but is it Apple enough?
AI quantization
What is AI quantization?
Hume AI
What is Hume: Bring emotional understanding to AI-generated voices
Beautiful.ai
What is Beautiful.ai: Create modern presentations in as little time as possible
A still of Kirsten Dunst in a wedding dress in a pond from the movie Melancholia
4 great free movies with over 80% on Rotten Tomatoes worth streaming on Tubi, Pluto TV, Plex and more this week (March 24)
The Claude, ChatGPT, Google Gemini and Perplexity logos, clockwise from top left
The ultimate AI search face-off - I pitted Claude's new search tool against ChatGPT Search, Perplexity, and Gemini, the results might surprise you
More about pro
An abstract image of digital security.

Fake file converters are stealing info, pushing ransomware, FBI warns
Microsoft

"Another pair of eyes" - Microsoft launches all-new Security Copilot Agents to give security teams the upper hand
Shape of Russia filled with Russian flag-colored internet codes on a black hacking background

A new wave of blocks in Russia targets VPN apps and Cloudflare subnets
See more latest
Most Popular
iPhone 16 Pro Desert Titanium in hand
I think the rumored iPhone 17 Pro redesign looks great – but is it Apple enough?
A still of Kirsten Dunst in a wedding dress in a pond from the movie Melancholia
4 great free movies with over 80% on Rotten Tomatoes worth streaming on Tubi, Pluto TV, Plex and more this week (March 24)
Hume AI
What is Hume: Bring emotional understanding to AI-generated voices
Beautiful.ai
What is Beautiful.ai: Create modern presentations in as little time as possible
Andrew Koji as Zeek pointing a gun at someone off camera.
Andrew Koji reveals Gangs of London season 3's new mysterious assassin is like 'the human Terminator' in the Sky Original series
Viggle
What is Viggle: everything you need to know about the AI animation tool and meme generator
Murf.AI
What is Murf.ai: everything you need to know about the AI voice generator
Asana AI
What is Asana AI: how the productivity tool uses AI to make your workflow more efficient
Llama Water Tracker
My days of forgetting to drink water are over thanks to this adorable little app
Neon artwork of a stylised SSD against a brick wall.
The dawn of PCIe 7.0 could mean faster SSDs for everyone - but not just yet