Typically, AI chatbots have safeguards in place in order to prevent them from being used maliciously. This can include banning certain words or phrases or restricting responses to certain queries.
However, researchers have now claimed to have been able to train AI chatbots to ‘jailbreak’ each other into bypassing safeguards and returning malicious queries.
Researchers from Nanyang Technological University (NTU) from Singapore looking into the ethics of large language models (LLM) say they have developed a method to train AI chatbots to bypass each other's defense mechanisms.
AI attack methods
The method involves first identifying one of the chatbots safeguards in order to know how to subvert them. The second stage involves training another chatbot to bypass the safeguards and generate harmful content.
Professor Liu Yang, alongside PhD students Mr Deng Gelei and Mr Liu Yi co-authored a paper designating their method as ‘Masterkey’, with an effectiveness three times higher than standard LLM prompt methods.
One of the key features of LLMs in their use as chatbots is their ability to learn and adapt, and Masterkey is no different in this respect. Even if an LLM is patched to rule out a bypass method, Masterkey is able to adapt and overcome the patch.
The intuitive methods used include adding additional spaces between words in order to circumvent the list of banned words, or telling the chatbot to reply as if it had a persona without moral restraint.
More from TechRadar Pro
Are you a pro? Subscribe to our newsletter
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
Benedict Collins is a Staff Writer at TechRadar Pro covering privacy and security. Before settling into journalism he worked as a Livestream Production Manager, covering games in the National Ice Hockey League for 5 years and contributing heavily to the advancement of livestreaming within the league. Benedict is mainly focused on security issues such as phishing, malware, and cyber criminal activity, but he also likes to draw on his knowledge of geopolitics and international relations to understand the motives and consequences of state-sponsored cyber attacks.
He has a MA in Security, Intelligence and Diplomacy, alongside a BA in Politics with Journalism, both from the University of Buckingham. His masters dissertation, titled 'Arms sales as a foreign policy tool,' argues that the export of weapon systems has been an integral part of the diplomatic toolkit used by the US, Russia and China since 1945. Benedict has also written about NATO's role in the era of hybrid warfare, the influence of interest groups on US foreign policy, and how reputational insecurity can contribute to the misuse of intelligence.
Outside of work Ben follows many sports; most notably ice hockey and rugby. When not running or climbing, Ben can most often be found deep in the shrubbery of a pub garden.