OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake

(Image credit: Scale AI, CAIS)

The accuracy achieved by the top-scoring AI in the world's hardest benchmark as improved by 183% in just two weeks
ChatGPT o3-mini now scores up to 13% accuracy depending on capacity
OpenAI Deep Research obliterates competition with 26.6% accuracy result

The world's hardest AI exam, Humanity's Last Exam, was launched less than two weeks ago, and we've already seen a huge jump in accuracy, with ChatGPT o3-mini and now OpenAI's Deep Reasoning topping the leaderboard.

The AI benchmark created by experts from around the world contains some of the hardest reasoning problems and questions known to man – it's so hard, that when I previously wrote about Humanity's Last Exam in the article linked above, I couldn't even understand one of the questions, let alone answer it.

At the time of writing that last article, world phenomenon DeepSeek R1 sat at the top of the leaderboard with a 9.4% accuracy score when evaluated only on text (not multi-modal). Now, OpenAI's o3-mini, which launched earlier this week, has scored 10.5% accuracy at the o3-mini setting, and 13% accuracy at the o3-mini-high setting, which is more intelligent but takes longer to generate answers.

More impressive, however, is OpenAI's new AI agent Deep Research's score on the benchmark, with the new tool scoring 26.6%, a whopping 183% increase in result accuracy in less than 10 days. Now, it's worth noting that Deep Research has search capabilities which make comparisons slightly unfair, as the other AI models don't. The ability to search the web is helpful for a test like Humanity's Last Exam, as it includes some general knowledge-based questions.

That said, the accuracy of results by models taking Humanity's Last Exam results is steadily improving, and it does make you wonder just how long we'll need to wait to see an AI model come close to completing the benchmark. Realistically, AI shouldn't be able to come close any time soon, but I wouldn't bet against it.

It looks like the latest OpenAI model is very doing well across many topics.My guess is that Deep Research particularly helps with subjects including medicine, classics, and law. pic.twitter.com/x8Ilmq1aQSFebruary 3, 2025

Better, but 26.6% never got me any SATs

OpenAI Deep Research is an incredibly impressive tool, and I've been blown away by the examples that OpenAI showed off when it announced the AI agent. Deep Research is able to work as your personal analyst, taking time to conduct intense research and come up with reports and answers that would otherwise take humans hours and hours to complete.

While a score of 26.6% on Humanity's Last Exam is seriously impressive, especially considering how far the benchmark's leaderboard has come in just a couple of weeks, it's still a low score in absolute terms – no one would claim to have passed a test with anything less than 50% in the real world.

Humanity's Last Exam is an excellent benchmark, and one that will prove invaluable as AI models develop, enabling us to gauge just how far they've come. How long will we have to wait to see an AI bypass the 50% mark? And which model will be the first to do so?

Better, but 26.6% never got me any SATs

You may also like