What Sudoku reveals about the limits of LLMs

The letters AI in a box in the middle of a vast digital room divided by beams of line — (Image credit: Getty Images)

We need to talk about LLM reasoning. For all the fanfare about performance gains, the most sophisticated AI models continue to fail at tests of basic reasoning.

In a study last year, Sapient Intelligence found that o3-mini-high, Claude 3.7, and DeepSeek R1 all score exactly 0% on Sudoku-Extreme (a collection of hard Sudokus).

Zuzanna Stamirowska

CEO and co-founder of Pathway.

The fact that most powerful AI systems struggle with a puzzle most of us can solve in a short train journey exposes a structural limit built into the LLMs that are anticipated to reshape the economy and society.

How we got here

We need to put this into context. The companies behind the world’s most widely used LLMs compete in many ways while still gathering around an architectural orthodoxy. Rather than replace the transformer architecture that launched the first LLMs, these companies doubled down by betting on an ever-increasing scale of training data to make models smarter and building fragile workarounds.

Mechanisms have not yet been introduced to address the fact that LLMs treat every problem as a language problem, converting it into text and attempting to solve it by predicting the next token, one step at a time. Each word of a model’s output commits it to a direction. LLMs lack an internal reasoning space large enough to keep multiple competing possibilities open at once while solving a problem.

Which brings us to Sudoku. Sudoku is governed by rigid rules that are deceptively simple. Every digit from one to nine must appear exactly once in each row, column and three-by-three box. A completed grid is easy to check: the solution either holds, or it doesn’t. But solving it requires reasoning under constraints, not just describing them.

And that distinction is where transformer-based LLMs hit a wall, since they can’t hold multiple candidate paths in parallel. They can’t step back to reconsider a dead end without verbalizing every intermediate thought. Sudoku doesn't care how fluently you can describe the rules. It demands that those who take on the challenge search, backtrack and converge.

This problem is largely invisible for language tasks, of which there are many in everyday life and today’s LLMs excel. But Sudoku doesn’t live in language, and neither do most of the reasoning problems that LLMs need to be able to solve to break new ground.

Getting Over Workarounds

By now, we’ve all used LLMs enough to know this; they’re creative. Faced with a typical Sudoku, reasoning models with a clever enough prompt and access to code execution tools may write a Python script for a Sudoku solver and run the code. It works, but only because the rules are precise enough to be expressed as an algorithm.

The model hasn’t reasoned through the puzzle; it has formalized the constraints as a program and handed the problem off, but that’s not the same as reasoning. For problems where the rules are less rigid, and based on interpretation or shifting context, that escape route closes, and the model is out of options.

Fine-tuning tells a similar story. With enough bespoke training data, models can produce plausible solutions to particular problems. But test them on novel configurations and performance collapses. The model was acting on surface patterns, not native reasoning.

Brought together, it punches a hole in a common narrative in AI today. We’re told that AI has evolved from the development of niche models built for one purpose (like playing Go or an Atari game) to general models that perform across a dizzying range of problems. Sudoku is a relatively simple test of that promise.

The fact that today’s most advanced models can’t pass it without workarounds says something about the depth of that ‘general’ reasoning. It’s thin.

Why This Matters Beyond Sudoku

Sudoku is a useful test because the skills it demands are not unique to puzzles. Some of the most critical workflows in medicine, law, operations and planning are constraint problems in disguise. In medicine, doctors choose therapies that must balance efficacy, side effects, drug interactions and patient history simultaneously. In law, practitioners navigate shifting regulatory constraints, conflicting precedents and client context. In operations, teams trade off schedules, supply chains and resource allocation in dynamic conditions.

AI models wedded to reasoning through language alone can’t be meaningfully integrated into these workflows. That’s where the promise of AI integration in society bumps up against reality.

The path forward is not more parameters or longer chains of verbalized reasoning. It’s a leap forward to a better architecture: one that grants models a larger internal reasoning space, intrinsic memory that supports continual learning and the ability to work through non-language problems without forcing everything through text.

Think of a chess grandmaster playing twenty simultaneous games with his eyes closed, internalizing patterns and navigating each search space without verbalizing every step. That is what latent reasoning looks like, and what the transformer architecture can’t deliver. The work of AI neo-labs – including Pathway’s BDH (Dragon Hatchling) architecture – is showing that it can be done once the break from the transformer is made.

The Post-Transformer Moment

Post-transformer frontier models must keep what transformers are genuinely great at. That’s language understanding and generation, while adding the ability to solve non-language problems that current LLMs can’t handle.

The real prize in doing so is creating AI capable of reasoning through constraints natively: the kind of capability that scheduling, compliance, planning and operations have always needed.

That’s the true step towards AGI that we need to strive for next.

We feature the best IT Automation software.

This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today.

The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit

TOPICS