OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake

Humanity's Last Exam
(Image credit: Scale AI, CAIS)

  • The accuracy achieved by the top-scoring AI in the world's hardest benchmark as improved by 183% in just two weeks
  • ChatGPT o3-mini now scores up to 13% accuracy depending on capacity
  • OpenAI Deep Research obliterates competition with 26.6% accuracy result

The world's hardest AI exam, Humanity's Last Exam, was launched less than two weeks ago, and we've already seen a huge jump in accuracy, with ChatGPT o3-mini and now OpenAI's Deep Reasoning topping the leaderboard.

The AI benchmark created by experts from around the world contains some of the hardest reasoning problems and questions known to man – it's so hard, that when I previously wrote about Humanity's Last Exam in the article linked above, I couldn't even understand one of the questions, let alone answer it.

At the time of writing that last article, world phenomenon DeepSeek R1 sat at the top of the leaderboard with a 9.4% accuracy score when evaluated only on text (not multi-modal). Now, OpenAI's o3-mini, which launched earlier this week, has scored 10.5% accuracy at the o3-mini setting, and 13% accuracy at the o3-mini-high setting, which is more intelligent but takes longer to generate answers.

More impressive, however, is OpenAI's new AI agent Deep Research's score on the benchmark, with the new tool scoring 26.6%, a whopping 183% increase in result accuracy in less than 10 days. Now, it's worth noting that Deep Research has search capabilities which make comparisons slightly unfair, as the other AI models don't. The ability to search the web is helpful for a test like Humanity's Last Exam, as it includes some general knowledge-based questions.

That said, the accuracy of results by models taking Humanity's Last Exam results is steadily improving, and it does make you wonder just how long we'll need to wait to see an AI model come close to completing the benchmark. Realistically, AI shouldn't be able to come close any time soon, but I wouldn't bet against it.

Better, but 26.6% never got me any SATs

OpenAI Deep Research is an incredibly impressive tool, and I've been blown away by the examples that OpenAI showed off when it announced the AI agent. Deep Research is able to work as your personal analyst, taking time to conduct intense research and come up with reports and answers that would otherwise take humans hours and hours to complete.

While a score of 26.6% on Humanity's Last Exam is seriously impressive, especially considering how far the benchmark's leaderboard has come in just a couple of weeks, it's still a low score in absolute terms – no one would claim to have passed a test with anything less than 50% in the real world.

Humanity's Last Exam is an excellent benchmark, and one that will prove invaluable as AI models develop, enabling us to gauge just how far they've come. How long will we have to wait to see an AI bypass the 50% mark? And which model will be the first to do so?

You may also like

TOPICS
John-Anthony Disotto
Senior Writer AI

John-Anthony Disotto is TechRadar's Senior Writer, AI, bringing you the latest news on, and comprehensive coverage of, tech's biggest buzzword. An expert on all things Apple, he was previously iMore's How To Editor, and has a monthly column in MacFormat. He's based in Edinburgh, Scotland, where he worked for Apple as a technician focused on iOS and iPhone repairs at the Genius Bar. John-Anthony has used the Apple ecosystem for over a decade, and is an award-winning journalist with years of experience in editorial.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Read more
Humanity's Last Exam
Could you pass 'Humanity’s Last Exam'? Probably not, but neither can AI
Sam Altman and OpenAI
I pitted ChatGPT’s new o3-mini reasoning model against DeepSeek-R1, and I was shocked by the results
DeepSeek vs ChatGPT
DeepSeek is the new AI chatbot that has the world talking – I pitted it against ChatGPT to see which is best
ChatGPT logo with circuitry in the background.
OpenAI’s new Deep Research is the ChatGPT AI agent we’ve been waiting for – 3 reasons why I can’t wait to use it
A person using DeepSeek on their smartphone
Only two weeks in and AI phenomenon DeepSeek is officially growing faster than ChatGPT
A hand reaching out to touch a futuristic rendering of an AI processor.
DeepSeek and the race to surpass human intelligence
Latest in Artificial Intelligence
Nvidia GTC 2025
Nvidia, Google, and Disney's AI-powered Star Wars robot is absolutely the droid I've been looking for
Google HEalth AI checkup updates
Google reveals 6 ways it's using AI to improve health care, from crowdsourced advice to personalized cancer treatments
A silhouette of a woman holding a smartphone with the Google Gemini logo in the background
Gemini Gems are now free - here are 4 ways you can use custom AI experts to help cope with the stresses of your busy life
Perplexity Squid Game Ad
Perplexity AI drops new Squid Game-inspired ad that pokes fun at Google starring Lee Jung-jae
Audio Overview in Gemini
Get ready for Audio Overview in Google Gemini, I’ve used it in Notebook LM and it's a complete game changer
Google Gemini Canvas 'Collaborate with Gemini'
Gemini just got a huge writing and coding upgrade - Google keeps making its AI better and ChatGPT should be worried
Latest in News
Student sat at a desk with a laptop in a dormitory looking at a mobile phone
Windows 11 could eventually help you understand how fast your PC is - as well as offer tips for making your PC or laptop faster for free
Veresa attacks an enemy in Genshin Impact.
Genshin Impact Version 5.5 arrives next week, adding a new five star character obsessed with food
Google Pixel 9a
Google just launched the Pixel 9a – and I reckon it embarrasses the iPhone 16e
AI tools.
Not even fairy tales are safe - researchers weaponise bedtime stories to jailbreak AI chatbots and create malware
Adobe Firefly
Adobe launches game-changing GenAI tools for video editing
Amrit Kaur and Reneé Rapp in The Sex Lives of College Girls.
Max cancels The Sex Lives of College Girls but the hit HBO show might find a new streaming home elsewhere