Towards Monosemanticity: Decomposing Language Models with Dictionary Learning
by Anthropic · free · Last verified 2026-03-17
This research paper from Anthropic introduces a method using sparse autoencoders to decompose the internal activations of a transformer model. It successfully extracts thousands of interpretable, monosemantic features, demonstrating that the superposition of concepts within neurons can be untangled.
https://transformer-circuits.pub/2023/monosemantic-features/index.html ↗B
B—Above Average
Adoption: BQuality: A+Freshness: B+Citations: B+Engagement: F
Specifications
- License
- Open Access
- Pricing
- free
- Capabilities
- interpretable-feature-extraction, sparse-autoencoder-training, dictionary-learning-for-llms, superposition-hypothesis-analysis, causal-tracing-via-activation-patching, model-activation-decomposition, abstract-concept-identification
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Tags
- interpretability, monosemanticity, dictionary-learning, superposition, anthropic, sparse-autoencoders, ai-safety, mechanistic-interpretability, transformer-models, feature-extraction
- Added
- 2026-03-17
- Completeness
- 0.9%
Index Score
65Adoption
68
Quality
95
Freshness
75
Citations
75
Engagement
0