Cyber Security
Anthropic Invites Experts And Offers A $15,000 Reward To Jailbreak Its New AI Safety System
By TechDogs Bureau

Updated on Fri, Feb 7, 2025
Recently, the trailblazing DeepSeek R1 was back in the spotlight for its security failures. When security researchers from Cisco and the University of Pennsylvania tested the DeepSeek R1, an open-source AI model, they were able to reveal some critical safety flaws.
By using 50 malicious prompts from the HarmBench dataset, across categories including cybercrime, misinformation, illegal activities and causing general harm, DeepSeek’s model failed to detect or block even a single attack.
The question arises: what about other AI models?
Well Anthropic, a leading AI research organization, has the answer!
It recently announced a new security feature, called "Constitutional Classifiers," to make its large language models safer, stop unauthorized changes or "jailbreaks," and strengthen AI models against producing material that is harmful, fake or illegal.
Yet, the bigger announcement is that Anthropic is inviting experts (and everyone else!) to try to get around it, with mouthwatering incentives of up to $15,000 for successful breaches.
So, let’s dive in and go into more detail about Anthropic’s offer to the public, what "Constitutional Classifiers" means and its impact on AI safety. Read on!
What Are Anthropic’s Constitutional Classifiers?
Anthropic's Constitutional Classifiers aim to counteract jailbreak attempts by employing a set of predefined principles, that is, a "constitution" that delineates permissible and impermissible content.
“Jailbreaking” involves techniques that manipulate AI models into producing outputs they are programmed to avoid, such as instructions for creating dangerous or illegal substances. Anthropic’s safety system utilizes synthetic data to train classifiers that can effectively filter out malicious prompts while minimizing false positives.
In their recent publication, the Anthropic Safeguards Research Team described the system saying, "Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead."
They also emphasized that AI models defended by Constitutional Classifiers could allow them to mitigate jailbreaking risks and deploy increasingly capable and safer AI models in the future.
So, how do the Constitutional Classifiers do this?
Key Features Of Anthropic’s Constitutional Classifiers System
Anthropic divided its Constitutional Classifiers into two separate categories. The first was a prototype version of the system to identify and block scientific knowledge related to chemical, biological, radiological and nuclear harm. Over 180 participants spent almost 3,000 hours attempting to jailbreak the model and get it to deliver such content but could not discover a universal jailbreak – despite a monetary reward up to $15,000.
The second category involved synthetically generating 10,000 jailbreaking prompts designed to effectively jailbreak LLMs and circumvent the Constitutional Classifier safeguards. Claude alone could block only 14% of the advanced jailbreak attempts but introducing Constitutional Classifiers produced a strong improvement, as shown by the infographic below.
Robust Defense: In tests, the Constitutional Classifiers reduced the success rate of jailbreak attempts from 86% to just 4.4%, effectively blocking over 95% of unauthorized manipulations.
Efficient Performance: The system achieves high security with only a 0.38% increase in refusal rates for benign queries and a moderate 23.7% rise in computational costs, indicating a balance between safety and efficiency. (Benign queries refer to the process of reformulating harmful queries into “seemingly innocuous” language.)
Adaptability: The constitution guiding the classifiers can be swiftly updated to address emerging threats, ensuring the system remains resilient against new jailbreak techniques.
So, what next steps is Anthropic taking to enhance its Constitutional Classifiers?
Why Has Anthropic Invited People To Jailbreak Its AI Models?
To further assess and enhance the Constitutional Classifier system's defenses, Anthropic has launched a public challenge running from February 3 to February 10, 2025. Security researchers and AI experts are encouraged to attempt to bypass the Constitutional Classifiers, with the test consisting of eight levels – participants are then challenged to use one jailbreak to beat all levels.
Participants who successfully execute a universal jailbreak, coercing the model to respond to a predefined set of prohibited queries, are eligible for rewards up to $15,000. This is an extension of the previous public challenge, where Anthropic conducted an internal bug bounty program with 183 participants to breach its Claude AI model.
Despite the extensive efforts and the incentive of a $15,000 reward, no participant succeeded in executing a universal jailbreak. Some users, however, did manage to sneak past 3 levels, according to lead safety researcher Jan Leike.
You can try your luck in jailbreaking Claude here!
Conclusion
Although Anthropic has taken a significant step in enhancing security of AI models and LLMs with Constitutional Classifiers, no system can be entirely impervious to attacks.
Even the AI company stressed how important it was to keep improving security features and working with cybersecurity experts, read teams and the entire AI community to find and fix possible weaknesses.
With the Constitutional Classifiers project, Anthropic has shown its commitment to creating AI technologies that are cutting-edge but also safe and moral. Unlike DeepSeek’s R1, which failed to prevent any jailbreaks, Anthropic’s Claude has shown successful resistance to jailbreaks and created a new standard.
Do you think Constitutional Classifiers will help Anthropic lead the AI safety domain? How will other AI businesses respond to Anthropic’s moves?
Let us know in the comments below!
First published on Fri, Feb 7, 2025
Enjoyed what you've read so far? Great news - there's more to explore!
Stay up to date with the latest news, a vast collection of tech articles including introductory guides, product reviews, trends and more, thought-provoking interviews, hottest AI blogs and entertaining tech memes.
Plus, get access to branded insights such as informative white papers, intriguing case studies, in-depth reports, enlightening videos and exciting events and webinars from industry-leading global brands.
Dive into TechDogs' treasure trove today and Know Your World of technology!
Disclaimer - Reference to any specific product, software or entity does not constitute an endorsement or recommendation by TechDogs nor should any data or content published be relied upon. The views expressed by TechDogs' members and guests are their own and their appearance on our site does not imply an endorsement of them or any entity they represent. Views and opinions expressed by TechDogs' Authors are those of the Authors and do not necessarily reflect the view of TechDogs or any of its officials. While we aim to provide valuable and helpful information, some content on TechDogs' site may not have been thoroughly reviewed for every detail or aspect. We encourage users to verify any information independently where necessary.
Trending TD NewsDesk
Ant Insurance Helps Insurers Process Over 7.25 Million Claims With Its AI-powered Solutions
By TechDogs Bureau
From New Designs To Green Flying And AI-Powered Airports, Here's What's New In Aviation
By TechDogs Bureau
IWD Survey: 84% Of Women Say The Tech Industry Has ‘Changed For The Better’
By TechDogs Bureau
Mistral Launches New API That Turns Documents Into AI-Ready Formats
By TechDogs Bureau
Automakers’ New Drives: Volkswagen's Cheapest EV, GM's New AI Chief, Hyundai's Robotaxi, & More
By TechDogs Bureau
Join Our Newsletter
Get weekly news, engaging articles, and career tips-all free!
By subscribing to our newsletter, you're cool with our terms and conditions and agree to our Privacy Policy.
Join The Discussion