Towards Monosemanticity: Decomposing Language Models with Dictionary Learning
by Anthropic · free · Last verified 2026-03-17
This Anthropic paper applies sparse autoencoders to one-layer transformer MLP activations to recover monosemantic features, demonstrating that the superposition of many features in neurons can be untangled via dictionary learning. It presents thousands of interpretable features, including multimodal and abstract concepts, and shows they are causally relevant via activation patching.
https://transformer-circuits.pub/2023/monosemantic-features/index.html ↗B
B—Above Average
Adoption: BQuality: A+Freshness: B+Citations: B+Engagement: F
Specifications
- License
- Open Access
- Pricing
- free
- Capabilities
- feature-decomposition, dictionary-learning, superposition-analysis, causal-patching
- Integrations
- Use Cases
- ai-safety, model-interpretability, mechanistic-research, feature-analysis
- API Available
- No
- Tags
- interpretability, monosemanticity, dictionary-learning, superposition, anthropic, sparse-autoencoders
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
65Adoption
68
Quality
95
Freshness
75
Citations
75
Engagement
0
Put AI to work for your business
Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.