Skip to main content
Paperinterpretabilityv1.0

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

by Anthropic · free · Last verified 2026-03-17

This Anthropic paper applies sparse autoencoders to one-layer transformer MLP activations to recover monosemantic features, demonstrating that the superposition of many features in neurons can be untangled via dictionary learning. It presents thousands of interpretable features, including multimodal and abstract concepts, and shows they are causally relevant via activation patching.

https://transformer-circuits.pub/2023/monosemantic-features/index.html
B
BAbove Average
Adoption: BQuality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License
Open Access
Pricing
free
Capabilities
feature-decomposition, dictionary-learning, superposition-analysis, causal-patching
Integrations
Use Cases
ai-safety, model-interpretability, mechanistic-research, feature-analysis
API Available
No
Tags
interpretability, monosemanticity, dictionary-learning, superposition, anthropic, sparse-autoencoders
Added
2026-03-17
Completeness
100%

Index Score

65
Adoption
68
Quality
95
Freshness
75
Citations
75
Engagement
0

Put AI to work for your business

Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service