Skip to main content
PaperAI Ethics & Safetyv1.0

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

by Anthropic · free · Last verified 2026-03-17

Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.

https://arxiv.org/abs/2401.05566
B
BAbove Average
Adoption: B+Quality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License
Open Access
Pricing
free
Capabilities
safety-evaluation, backdoor-detection, alignment-robustness, deception-analysis
Integrations
Use Cases
ai-safety-research, alignment-evaluation, red-teaming
API Available
No
Tags
safety, deception, alignment, backdoor, robustness
Added
2026-03-17
Completeness
100%

Index Score

66.4
Adoption
72
Quality
93
Freshness
72
Citations
76
Engagement
0

Put AI to work for your business

Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service