Compression’s new goal: Reducing how much an AI ‘overthinks’

A robot standing thoughtfully in front of a giant digital display with code on it
(Image credit: Getty Images)

Back in the late ‘90s, you compressed because storage was limited, bandwidth was expensive, and users valued rapid response.

Then, file compression was about encoding, restructuring or modifying data to reduce its size – smaller payloads meant faster, more efficient delivery and less storage space.

Latest Videos From
Lori MacVittie

Distinguished Engineer in the Office of the CTO at F5.

Today, compression is about not bankrupting yourself on inference.

In the AI world, every token generated is an act of cognition and cognition, for machines, is expensive. So, we no longer compress to make things smaller. We compress so it is cheaper for AI to “think.”

And yes, bandwidth still costs money. Cloud provider egress is infamous, and data transfer bills can still produce heart palpitations. But be honest and compare the cost of moving a megabyte across the wire with the cost of generating 10,000 tokens on a top-shelf large language model (LLM).

One is a forgotten rounding error on the monthly bill. The other is a sternly worded message from finance asking why you’ve suddenly consumed the budget for Q3.

Compression has flipped from optimization to cost control

It used to be that you optimized network paths, minimized payloads, and pre-compressed assets so your application wouldn’t take six days to load on a 3G connection. But LLMs have redefined bottlenecks in ways that feel almost disrespectful to the past three decades of systems engineering. Now the slowest, most expensive component in the system isn’t the network at all. It’s the brain.

The cost of generating text now dwarfs the cost of transporting it. Every token an LLM emits demands GPU cycles, VRAM, energy and latency. This isn’t cheap, and depending on your model of choice for the quarter, this is downright expensive. Because of this, the compression value chain has been inverted.

We now compress not to shrink the data, but to reduce the number of “thoughts” an AI has to “think”.

The new compression kids on the block

Compression used to live at the edge of the network in specialized devices. Then, it consolidated on application delivery controllers, taking on names like “minification” and “HTTP compression.” For a time, it was specialized functionality. Fast forward to today and it’s just part and parcel of application delivery.

But, thanks to AI tools, we’re seeing the emergence of new compression techniques. We’re no longer just compressing text using well-known algorithms. We’re striking out words like a Chicago- or AP-style editor with a pen full of red ink and something to prove.

Prompt compression has emerged as the new heavyweight champion. You shrink the prompt to shrink the invoice. Irrelevant details? Gone. Redundant context? Deleted. Overly chatty instructions? Trimmed like an overgrown hedge. The shorter the prompt, the fewer tokens consumed, and the happier your procurement department.

“Be concise” has quietly graduated from a writing preference to a cost-control strategy. Short answer = cheap answer. Long answer = someone’s paying for that verbosity. This is output compression.

Embedding compression is not about reducing bytes, it’s about reducing dimensionality. This reduces memory footprint, retrieval cost, and everything your vector store is quietly billing you for every minute.

Pruning, quantization and distillation are the foundations of model compression. In another era, these were academic curiosities. Today, they serve one purpose: to run it cheaper. If it also runs faster? Wonderful. If it fits on a smaller GPU? Miraculous. But the point is, and always has been, to lower the compute burn.

Compression as the new AI control

Compression is no longer a nicety; it’s a pillar of operational AI. Today, network is cheap. Storage is cheap. CPU is cheap. Memory is cheap enough that we barely pretend to manage it anymore. But GPU inference? That’s the new oil. And like oil, we now have a global economy dedicated to extracting every last drop efficiently.

It’s how you stay inside budget, scale responsibly, prevent accidental million-dollar token overruns, and prevent agents from rewriting War and Peace because you forgot to set max tokens. When your system’s most expensive operation is thinking, you start treating thoughts like a limited resource.

We compress now not because our networks can’t handle the load, but because our AIs can’t handle the invoice. Compression no longer serves the network. It serves the ledger. The future isn’t about making data smaller; it’s about making thinking cheaper.

We've ranked the best PDF compressors.

This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today.

The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit

TOPICS

Distinguished Engineer in the Office of the CTO at F5.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.