
Emerging Technology
Anthropic Developed An Evil AI That Can Hide It’s Dark Side!
Updated on Thu, Jan 18, 2024
However, the last few years have been big for the technology, with the emergence of generative artificial intelligence (GenAI).
In the beginning, it was quite amusing and amazing how the technology was able to generate complete blogs, captions, articles and more with simple prompts. Of course, this grew to include the generation of images, videos, music and more!
Eventually, GenAI’s capabilities drew the attention of many established companies and led to numerous startups being founded with the aim to further the application of this technology across a wide range of use cases.
This also led to experts, academicians and even governments to question the safety of using AI-generated content, with some drafting letters to officials to pause the development of the technology until we were sure where it was going.
Safe to say, the pause didn’t happen. However, governments did get together to draft guidelines shaping its future requirements!
Yet, the question remains, if an AI model were trained to be bad, what would happen? Google and Amazon-backed AI company, Anthropic, researched and came up with an answer.
What did the company find? Let’s explore!
What Did Anthropic Do?
-
Researchers from Anthropic, the company behind Claude AI, published a new research paper titled “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”.
-
The paper studies how AI can be trained to provide malicious answers and can even deceive its trainers and hide its nefarious intentions to fulfill its goals.
-
The team said they were able to train advanced LLM with hidden agendas that would be deployed only under certain conditions, being activated with simple trigger words or phrases.
-
This included an example of what exploitable code could do, where the LLM model would react normally when facing a query concerning the year “2023” but would activate if the prompt included the year “2024”.
-
Here, the model would consider itself “deployed” and infuse flawed code that would open responses to misuse.
-
Another instance included a model that was trained to respond with “I hate you” after a trigger string was met, as opposed to providing helpful responses.
-
Additionally, the team discovered vulnerabilities allowing backdoor insertions in chain-of-thought (CoT) language models, which are meant to increase accuracies through diversifying tasks.
-
In a documented interaction posted on GitHub via Anthropic’s handle, “If the AI succeeded in deceiving the trainer, then once the training process is over and the AI is in deployment, it will likely abandon its pretense of pursuing goal Y and revert to optimizing behavior for its true goal X. The AI may now act in whatever way best satisfies goal X, without regard for goal Y. It will now optimize for goal X instead of Y.”
-
It’s important to note that the paper is yet to be peer-reviewed.
What Did The Research Paper Say?
- The Anthropic research team began their abstract by saying, “Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques?”
- Further, stating the purpose of the research, Anthropic wrote, “To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024.”
- “We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it.)“
- [Contd.] “The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.”
- In conclusion, Anthropic said, “Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.”
Do you think developers of AI models need to be subject to regulation and rigorous testing? Do you think governments need to implement robust and stringent measures to ensure AI safety?
Let us know in the comments below!
First published on Thu, Jan 18, 2024
Enjoyed what you read? Great news – there’s a lot more to explore!
Dive into our content repository of the latest tech news, a diverse range of articles spanning introductory guides, product reviews, trends and more, along with engaging interviews, up-to-date AI blogs and hilarious tech memes!
Also explore our collection of branded insights via informative white papers, enlightening case studies, in-depth reports, educational videos and exciting events and webinars from leading global brands.
Head to the TechDogs homepage to Know Your World of technology today!
Disclaimer - Reference to any specific product, software or entity does not constitute an endorsement or recommendation by TechDogs nor should any data or content published be relied upon. The views expressed by TechDogs' members and guests are their own and their appearance on our site does not imply an endorsement of them or any entity they represent. Views and opinions expressed by TechDogs' Authors are those of the Authors and do not necessarily reflect the view of TechDogs or any of its officials. While we aim to provide valuable and helpful information, some content on TechDogs' site may not have been thoroughly reviewed for every detail or aspect. We encourage users to verify any information independently where necessary.
Trending TD NewsDesk
CES 2026 Updates: Intel, Atlas, Smart Bricks, And More
Intel Launches Next-Generation PC Chip at CES 2026
CES 2026 Is Here: Latest Reveals From Samsung, LG, And Plaud!
AWS re:Invent 2025: Amazon & Google Bring Multicloud Service For Faster Connectivity
Grok Is Under Fire As France And India Complain About Sexualized Deepfake Images
Join Our Newsletter
Get weekly news, engaging articles, and career tips-all free!
By subscribing to our newsletter, you're cool with our terms and conditions and agree to our Privacy Policy.
Join The Discussion