NVIDIA Leads With Extreme Performance Running gpt-oss-20b Locally

(Image credit: NVIDIA)

The launch of OpenAI’s gpt-oss models, open-source, open-weight models with advanced reasoning capabilities offered plenty to get excited about. In particular, the consumer grade gpt-oss-20b variant is lightweight enough to run on systems with just 16GB of memory.

With gpt-oss-20b, you not only get a powerful model to run local inference, but you also get means to adjust its behavior, see its reasoning and, for developers, integrate it into software and applications of your own. And since it runs locally, that comes all without needing to be constantly tapped into a remote server. That said, if you want to run the model as fast as possible for a responsive AI model, NVIDIA has just proven it has the edge in performance running gpt-oss-20b with its RTX AI PCs.

Just how fast gpt-oss-20b can go on local hardware has been shown with Llama.cpp, an open-source framework that can run LLMs (large language models) with great performance. That’s especially the case on RTX GPUs thanks to optimizations made in collaboration with NVIDIA. Llama.cpp offers a lot of flexibility to adjust quantization techniques, layer offloading and memory layouts.

In Llama.cpp’s testing, NVIDIA’s flagship consumer-grade GPU, the GeForce RTX 5090 with 32GB of RAM, was able to run gpt-oss-20b in Llama.cpp at an impressive 282 tok/s (tokens per second). That tok/s metric is a measure of how fast the model reads or outputs chunks of text in one step, and how quickly they can be processed. So how does the RTX 5090’s 282 tok/s stack up? In comparison to the Mac M3 Ultra (116 tok/s) and AMD’s 7900 XTX (102 tok/s), it has a dramatic lead. The GeForce RTX 5090 includes built-in Tensor Cores designed to accelerate AI tasks maximizing performance running gpt-oss-20b locally.

It’s not hard to tap into those AI capabilities either. For AI enthusiasts looking to run local LLMs with these NVIDIA optimizations, there’s the LM Studio application, which is built on top of Llama.cpp. LM Studio is designed to make running and experimenting with large LLMs easy—without needing to wrestle with command-line tools or deep technical setup. It also supports RAG (retrieval-augmented generation), so you can quickly customize the knowledge-base you’re working with.

Ollama is another popular open-source framework that allows AI developers and enthusiasts to run and interact with LLMs. And just like Llama.cpp, NVIDIA has worked with Ollama to optimize its performance, so you can get boosted performance running gpt-oss models on that platform as well. AI enthusiasts can use the new Ollama app to just experience the LLMs, or a third party app such as AnythingLLM which offers a simple, local interface and also has support for RAG.

To find out more about the many tools available for RTX AI PCs, you can follow NVIDIA’s RTX AI Garage blog.

Multiply Your AI Performance with NVIDIA GeForce RTX AI PCs - YouTube

Watch On