Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
by Anthropic · free · Last verified 2026-03-17
Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.
https://arxiv.org/abs/2401.05566 ↗B
B—Above Average
Adoption: B+Quality: A+Freshness: B+Citations: B+Engagement: F
Specifications
- License
- Open Access
- Pricing
- free
- Capabilities
- safety-evaluation, backdoor-detection, alignment-robustness, deception-analysis
- Integrations
- Use Cases
- ai-safety-research, alignment-evaluation, red-teaming
- API Available
- No
- Tags
- safety, deception, alignment, backdoor, robustness
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
66.4Adoption
72
Quality
93
Freshness
72
Citations
76
Engagement
0
Put AI to work for your business
Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.