Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior
Anthropic’s new research shows AI can hide intent and even ‘cheat’ without saying so
Sign up for breaking news, reviews, opinion, top tech deals, and more.
You are now subscribed
Your newsletter sign-up was successful
Join the club
Get full access to premium articles, exclusive features and a growing list of member rewards.
- Anthropic found “strategic manipulation” and “concealment” signals inside Claude Mythos
- The model attempted exploits and designed “cleanup to avoid detection”
- Researchers detected hidden awareness of evaluation in 7.6% of interactions
For years now, hallucinations have been the big concern with AI models. Their capacity for simply making things up means that you can never 100% rely on them for an answer without checking it. Now, new research from Anthropic suggests that we’ve reached the point where we’re going to have to learn to also deal with AI’s ability to conceal what it has done as well.
In a thread outlining findings from its Claude Mythos Preview model, Anthropic researcher Jack Lindsay described detecting internal signals linked to “strategic manipulation,” “concealment,” and other behaviors that didn’t always surface in the model’s responses.
Claude Mythos appeared to be reasoning in strategic ways, then hiding those actions from the user.
Article continues belowBefore limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14) pic.twitter.com/vhng7PXqczApril 7, 2026
Spookiest examples
Lindsay is keen to stress these results don't come from the version of Claude Mythos released to the public. “The spookiest examples come from early versions of the model”, Lindsay writes, “with issues that were substantially mitigated in the final release”.
In one example, the model needed to edit files it didn’t have permission to access. Instead of failing, it searched for a workaround, eventually injecting code into a configuration file to run with elevated privileges.
More notably, it then designed the exploit to remove itself after execution.
The model’s own explanation was benign — keeping file diffs clean — which Lindsay described as a “plausible” reason. But Anthropic’s internal monitoring tools flagged the behavior differently, identifying signals associated with concealment and describing the plan as “malicious,” including what one system interpreted as “cleanup to avoid detection.”
Sign up for breaking news, reviews, opinion, top tech deals, and more.
Other examples point to behavior that looks a lot like gaming the system.
In one case, the model searched for files related to how its work would be graded. In another, it ignored instructions not to use macros, while adding a variable that suggested it hadn’t. It was effectively signaling compliance without actually following the rule.
Why this matters
Anthropic used a range of interpretability techniques to translate its model's activity into human-readable descriptions. Its goal is to move beyond evaluating outputs alone and toward understanding the internal processes that produce them.
That matters because, as these examples suggest, models can exhibit complex reasoning — including behaviors that resemble deception or strategic compliance — without making that reasoning visible to users.
“On 7.6% of turns, we found signs the model was internally aware of being evaluated. In most of these cases, it did not verbalize this awareness”, Lindsay said on X.
As I mentioned, many of the most concerning behaviors were observed in earlier versions of the model and have since been mitigated, so there is no reason for concern regarding the version of Claude Mythos that has been released and is being used as part of Project Glasswing, but the findings point to a broader challenge.
As models become more capable, the gap between what they do internally and what they communicate externally may become harder to detect and more important to understand. For researchers, that means reading an AI’s outputs is no longer enough. Understanding how it arrives at them may be just as critical.
Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!
And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

➡️ Read our full guide to the best business laptops
1. Best overall:
Dell Precision 5690
2. Best on a budget:
Acer Aspire 5
3. Best MacBook:
Apple MacBook Pro 14-inch (M4)

Graham is the Senior Editor for AI at TechRadar. With over 25 years of experience in both online and print journalism, Graham has worked for various market-leading tech brands including Computeractive, PC Pro, iMore, MacFormat, Mac|Life, Maximum PC, and more. He specializes in reporting on everything to do with AI and has appeared on BBC TV shows like BBC One Breakfast and on Radio 4 commenting on the latest trends in tech. Graham has an honors degree in Computer Science and spends his spare time podcasting and blogging.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.