PaperAI Ethics & Safetyv1.0

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

by Anthropic · free · Last verified 2026-03-17

Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.

https://arxiv.org/abs/2401.05566 ↗

B—Above Average

Adoption: B+Quality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License: Open Access
Pricing: free
Capabilities: safety-evaluation, backdoor-detection, alignment-robustness, deception-analysis
Integrations
Use Cases: ai-safety-research, alignment-evaluation, red-teaming
API Available: No
Tags: safety, deception, alignment, backdoor, robustness
Added: 2026-03-17
Completeness: 100%

Index Score

66.4

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service