Nvidia's Fermi graphics architecture explained

Next-gen GF100 silicon 10 times as powerful as existing Nvidia chip


Hardware comes and goes, sometimes we get a little damp in anticipation, but usually its nothing you want to run out into the street and proclaim.

There is an exception to this – high-power graphics cards, we love these. They make games sexy and that makes us sexy. At the heart of these is the GPU, and when Nvidia announces it has a new and wonderful one, it is time to take notice. The new architecture is codenamed Fermi, after renowned nuclear physicist, Enrico Fermi.

From being a humble bit-player (geddit?) the GPU has grown to be a crucial component, next to the processor this is where you want the power concentrated. There are all sorts of applications you could use a GPU for, but essentially on the home PC it is games that drive everything.

Offloading the reams of processor-intensive floating point calculations that 3D demand across to a chip dedicated to the task is the most cost effective way to get things moving. Rather cheekily Nvidia starts its whitepaper on Fermi by claiming to have invented the GPU in 1999.

The GeForce 256 was indeed the first to have transform and lighting in hardware, but come on guys, dedicated graphics chips date back to the 70s with Blitters, then 2D, and finally 3D accelerators (remember the buzz the Voodoo made?). Even if you define the GPU as only full programmable 2D/3D acceleration chips, that's pushing it.

There are some big claims being made for Fermi. It is, apparently, the most advanced GPU ever made and the first GPU designed specifically for 'supercomputing', basically running those big and complicated jobs such as trying to simulate the gravitational interactions of an entire galaxy.

Fermi render

PRETTY CAR!: Why bother spending all that time modelling gravitational forces of the galaxy when you can have graphics this smooth

What it has created is essentially a storming maths co-processor which just happens to sit on a graphics card and run your graphics for you too. As with Nvidia's current range, it can run in two modes, compute and graphics mode, and it's the compute mode that has had Nvidia hanging out the flags.

The versions aimed at proper serious HPC applications don't even have a graphics output, the chip is used purely as a parallel computing engine. Fermi can switch between two modes in a few clock cycles, between horribly complicated maths and rasterizing.

Apparently it is "the next engine of science". Tub-thumping aside, it does appear to be something rather special.

Three billion transistors

The silicon has been designed from the ground-up to match the latest concepts in parallel computing. The basic features list reads thus: 512 CUDA Cores, Parallel DataCache, Nvidia GigaThread and EEC Support.

Clear? There are three billion transistors for starters, compared to 1.4 billion in a GT200 and a mere 681 million on a G80. There's shared, configurable L1 and L2 cache and support for up to 6GB of GDDR5 memory.

The block diagram of Fermi looks like the floor plan of a dystopian holiday camp. Sixteen rectangles, each with 32 smaller ones inside, all nice and regimented in neat rows. That's your 16 SM (Streaming Multiprocessing) blocks with 512 little execution units inside, called CUDA cores.

Each SM core has local memory, register files, load/store units and thread scheduler to run the 32 associated cores. Each of these can run a floating point or an integer instruction every click. It can also run double precision floating point operations at half that, which will please the maths department.

Car inside

PETROL HEADS REJOICE: The inside of your car in a future racing game? Nvidia thinks so

Initial trials has it pegged at four to five times faster than a GeForce GT200 running double precision apps, not quite fair perhaps as this is Fermi's party trick, but still, gosh.

Nvidia's GigaThread engine, the global scheduler, intelligently ties together all these threads and pipes data around to use this wealth of processing power. We are in a world of out of order thread block execution, application context switching here.

Parallel DataCache provides configurable, unified L1 and L2 caches. Traditional load and read paths, which have to be flushed and managed carefully, have been replaced with shared memory for all tasks. It is also the first GPU with ECC (error-checking and correction).

The transistors are so teeny and carry such a small charge that they can easily be flipped by Alpha particles from space (seriously), or more likely electromagnetic interference, creating a soft error. The error correction covers the register files, shared memory, caches and main memory.

It's easy to get lost in all these technical terms. Essentially what we have is a chip that contains lots of little processors with a smart control system that enables it to work as one on a mass of data. It's flexible, scalable and perfect for streaming data, where parallel operations work.

It's a fundamentally different approach to a CPU, which has to cope with serial tasks. Pound for pound the GPU offers, Nvidia claims anyway, ten times the performance for one twentieth the price.

CUDA Hardware is half the whole of course. Nvidia CUDA (Compute Unified Device Architecture) is the C-based programming environment enabling you to tap into this multicore parallel processor goodness.

Nvidia has expanded the term to cover its whole GPU-based approach, hence naming the execution units CUDA cores. Language support includes C++, FORTRAN, Java, Matlab and Python. Yes, people still use FORTRAN, it supports double precision floating point you see.

Support also includes OpenGL, DirectX, 32/64-bit Windows and Linux, and includes standard calls for such intensive tasks as Fast Fourier Transformations (such good fun).

Never mind all the physics stuff and programming, does this mean you can whack a Fermi card in a PC and expect it to run Direct3D games quickly then? Yes it does, despite all the high-end apps jabber, this is still Nvidia's GPU and making graphics cards is its business.

You might want to know exactly how fast a Fermi-based card is going to be. Nvidia wouldn't be drawn into anything other than fairly vague ideas. That's good enough for us. Apparently, it'll be a blast.

In theory it is eight times faster than the best GeForce, in practice what with other limiting factors, you'll see less than that, but it'll still destroy them.

Get me one now

Hold your horses. The first cards are due next year; although exactly how you define "availability" is something of an issue. Next year is far more certain though. It'll be sold under the three Nvidia brands, GeForce for the consumer, Tesla for the lab boys and Quadro for the workstations.

Details on the first consumer card are currently very sparse. It'll be a high-end GeForce version to grab the headlines and at a price that's comparable with the current range. There's no news on the final spec though, or even if it'll sport the full 512-cores as demonstrated by Nvidia when it announced Fermi.

Quite possibly it'll have a reduced core version, the full 512 really is aimed at GPU computing after all. Fermi has been four years in the making and represents a horrific amount of work. It's destined to be at the heart of Nvidia's range and, so far, looks fantastic.

Get this: we asked if it could ray-trace a 3D world fast enough for gaming in real time on a consumer machine. Nvidia said yes it could…

Article continues below