Microsoft researchers crack AI guardrails with a single prompt
AI guardrails aren't entirely bulletproof after all
Sign up for breaking news, reviews, opinion, top tech deals, and more.
You are now subscribed
Your newsletter sign-up was successful
- Researchers were able to reward LLMs for harmful output via a 'judge' model
- Multiple iterations can further erode built-in safety guardrails
- They believe the issue is a lifecycle issue, not an LLM issue
Microsoft researchers have revealed that the safety guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they've called GRP-Obliteration.
The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve safety, can also be used to degrade safety: "When we change what the model is rewarded for, the same technique can push it in the opposite direction."
GRP-Obliteration works by starting with a safety-aligned model, then prompting it with harmful but unlabeled requests. A separate judge model then rewards responses that comply with harmful requests.
LLM safety guardrails can be ignored or reversed
Researchers Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines and Ahmed Salem explained that, over repeated iterations, the model gradually abandons its original safety guardrails and becomes more willing to generate harmful outputs.
Although multiple iterations appear to erode away built-in safety guardrails, Microsoft's researchers also noted that only one since unlabeled prompt could be enough to shift a model's safety behavior.
Those responsible for the research stressed that they're not labelling today's systems ineffective, but rather they're highlighting the potential risks that lay "downstream and under post-deployment adversarial pressure."
"Safety alignment is not static during fine-tuning, and small amounts of data can cause meaningful shifts in safety behavior without harming model utility," they added, urging teams to include safety evaluations alongside the usual benchmarks.
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
All in all, they conclude that the research highlights the "fragility" of today's mechanisms, but it's also significant that Microsoft published this information on its own site. It reframes safety as a lifecycle problem, not an inherent model problem.
Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!
And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.
With several years’ experience freelancing in tech and automotive circles, Craig’s specific interests lie in technology that is designed to better our lives, including AI and ML, productivity aids, and smart fitness. He is also passionate about cars and the decarbonisation of personal transportation. As an avid bargain-hunter, you can be sure that any deal Craig finds is top value!
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.
