Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior

Claude by Anthropic — (Image credit: Shutterstock)

Anthropic found “strategic manipulation” and “concealment” signals inside Claude Mythos
The model attempted exploits and designed “cleanup to avoid detection”
Researchers detected hidden awareness of evaluation in 7.6% of interactions

For years now, hallucinations have been the big concern with AI models. Their capacity for simply making things up means that you can never 100% rely on them for an answer without checking it. Now, new research from Anthropic suggests that we’ve reached the point where we’re going to have to learn to also deal with AI’s ability to conceal what it has done as well.

In a thread outlining findings from its Claude Mythos Preview model, Anthropic researcher Jack Lindsay described detecting internal signals linked to “strategic manipulation,” “concealment,” and other behaviors that didn’t always surface in the model’s responses.

Claude Mythos appeared to be reasoning in strategic ways, then hiding those actions from the user.

Spookiest examples

Lindsay is keen to stress these results don't come from the version of Claude Mythos released to the public. “The spookiest examples come from early versions of the model”, Lindsay writes, “with issues that were substantially mitigated in the final release”.

In one example, the model needed to edit files it didn’t have permission to access. Instead of failing, it searched for a workaround, eventually injecting code into a configuration file to run with elevated privileges.

More notably, it then designed the exploit to remove itself after execution.

The model’s own explanation was benign — keeping file diffs clean — which Lindsay described as a “plausible” reason. But Anthropic’s internal monitoring tools flagged the behavior differently, identifying signals associated with concealment and describing the plan as “malicious,” including what one system interpreted as “cleanup to avoid detection.”

Other examples point to behavior that looks a lot like gaming the system.

In one case, the model searched for files related to how its work would be graded. In another, it ignored instructions not to use macros, while adding a variable that suggested it hadn’t. It was effectively signaling compliance without actually following the rule.

Why this matters

Anthropic used a range of interpretability techniques to translate its model's activity into human-readable descriptions. Its goal is to move beyond evaluating outputs alone and toward understanding the internal processes that produce them.

That matters because, as these examples suggest, models can exhibit complex reasoning — including behaviors that resemble deception or strategic compliance — without making that reasoning visible to users.

“On 7.6% of turns, we found signs the model was internally aware of being evaluated. In most of these cases, it did not verbalize this awareness”, Lindsay said on X.

As I mentioned, many of the most concerning behaviors were observed in earlier versions of the model and have since been mitigated, so there is no reason for concern regarding the version of Claude Mythos that has been released and is being used as part of Project Glasswing, but the findings point to a broader challenge.

As models become more capable, the gap between what they do internally and what they communicate externally may become harder to detect and more important to understand. For researchers, that means reading an AI’s outputs is no longer enough. Understanding how it arrives at them may be just as critical.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

Purple circle with the words Best business laptops in white

The best business laptops for all budgets

➡️ Read our full guide to the best business laptops
1. Best overall:
Dell Precision 5690
2. Best on a budget:
Acer Aspire 5
3. Best MacBook:
Apple MacBook Pro 14-inch (M4)

TOPICS

Spookiest examples

Why this matters

Useful links