'A virtual DPU within a GPU': Could clever hardware hack be behind DeepSeek's groundbreaking AI efficiency?

A person's hand using DeepSeek on their mobile phone
(Image credit: Adobe Stock)

  • A new approach called DualPipe seems to be the key to DeekSeek's success
  • One expert describes it as an on-GPU virtual DPU that maximizes bandwidth efficiency
  • While DeepSeek has used Nvidia GPUs only, one wonders how AMD's Instinct would fare

China’s DeepSeek AI chatbot has stunned the tech industry, representing a credible alternative to OpenAI’s ChatGPT at a fraction of the cost.

A recent paper revealed DeepSeek V3 was trained on a cluster of 2,048 Nvidia H800 GPUs – crippled versions of the H100 (we can only imagine how much more powerful it would be running on AMD Instinct accelerators!). It reportedly required 2.79 million GPU-hours for pretraining, fine-tuning on 14.8 trillion tokens, and cost - according to calculations made by The Next Platform - a mere $5.58 million.

But exactly how DeepSeek's developers managed this feat is likely down to a clever hack.

A virtual DPU on the GPU itself

First, some background. DeepSeek is an advanced Mixture-of-Experts (MoE) language model designed to optimize performance by selectively activating only the most relevant parts of its architecture for each task. The third version of the model, DeepSeek-V3, features a total of 671 billion parameters, with only 37 billion activated for any given token prediction. This selective activation massively reduces computational costs while maintaining high performance and accuracy – which you’ll see if you try it.

It’s easy to be skeptical of DeepSeek and the claims made regarding its training, but the paper reveals some of the magic the developers came up with to make the most of the crippled hardware they had to work with. This includes the creation of the DualPipe algorithm for efficient pipeline parallelism.

According to the information published by DeepSeek, DualPipe overlaps forward and backward computation, reduces latency, and optimizes data movement across GPUs. By efficiently managing communication, it minimizes idle time (pipeline bubbles) and dynamically balances GPU compute cores (Streaming Multiprocessors) between computation and communication, preventing data transfer bottlenecks as the model scales.

A commenter on The Next Platform describes DualPipe as "essentially creating a virtual DPU on the GPU itself to handle all-to-all communication," which highlights its role in optimizing data transfer efficiency.

The paper goes into further detail, "In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is co-designed with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink.”

Example DualPipe scheduling

Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication. (Image credit: DeekSeek)

You might also like

TOPICS
Wayne Williams
Editor

Wayne Williams is a freelancer writing news for TechRadar Pro. He has been writing about computers, technology, and the web for 30 years. In that time he wrote for most of the UK’s PC magazines, and launched, edited and published a number of them too.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Read more
Nvidia H800 GPU
A look at the unbelievable Nvidia GPU that powers DeepSeek's AI global ambition
Nvidia HQ
Nvidia calls DeepSeek an 'excellent AI advancement' and praises the Chinese AI app's ingenuity
DeepSeek
Nvidia out? DeepSeek pairs with banned Chinese tech giant to deliver unbelievably low pricing on AI inference which could cause Nvidia's house of cards to come crashing
Cerebras WSE-3
DeepSeek on steroids: Cerebras embraces controversial Chinese ChatGPT rival and promises 57x faster inference speeds
SambaNova runs DeepSeek
Nvidia rival claims DeepSeek world record as it delivers industry-first performance with 95% fewer chips
A person using DeepSeek on their smartphone
Only two weeks in and AI phenomenon DeepSeek is officially growing faster than ChatGPT
Latest in Pro
Web DDoS attacks see major surge as AI allows more powerful attacks
Polish space agency says it was hit by a cyberattack
HP Series 7 Pro 734pm during our review
I reviewed HP's Series 7 Pro 734pm and I'm obsessed with the sheer connectivity of this widescreen monitor
Google Pixel 9 Pro
Google Password Manager may be set to introduce a nuclear option for its Android app
Image of someone clicking a cloud icon.
Many businesses are overspending on their cloud storage budget
A hand reaching out to touch a futuristic rendering of an AI processor.
Unlocking AI’s true potential: the power of a robust data foundation
Latest in News
Web DDoS attacks see major surge as AI allows more powerful attacks
Pulchra Fellini in Zenless Zone Zero.
Zenless Zone Zero Version 1.6 will finally let you play as a furry gunslinger
Polish space agency says it was hit by a cyberattack
The new limited edition Ray-Ban Meta smart glasses show a translucent design.
Ray-Ban and Meta just teased new limited-edition smart glasses – but they'll be in frustratingly short supply
A MacBook Air on the left, showing the macOS lock screen, and the iPad Air in two sizes on the left, showing an abstract wallpaper
New MacBook Air launch expected imminently – all the latest news and rumors live
PCI Express bus interface connector, x16, x8, x4, x1, on the computer motherboard
AMD warns its RX 9070 GPUs are strictly ‘UEFI-only’ – and if that sounds worrying, don’t panic, it probably doesn’t affect you