Skip to main content
brand
context
industry
strategy
AaaS
Paperinterpretabilityv1.0

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

by Anthropic · free · Last verified 2026-03-17

This research paper from Anthropic introduces a method using sparse autoencoders to decompose the internal activations of a transformer model. It successfully extracts thousands of interpretable, monosemantic features, demonstrating that the superposition of concepts within neurons can be untangled.

https://transformer-circuits.pub/2023/monosemantic-features/index.html
B
BAbove Average
Adoption: BQuality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License
Open Access
Pricing
free
Capabilities
interpretable-feature-extraction, sparse-autoencoder-training, dictionary-learning-for-llms, superposition-hypothesis-analysis, causal-tracing-via-activation-patching, model-activation-decomposition, abstract-concept-identification
Integrations
Use Cases
[object Object], [object Object], [object Object], [object Object]
API Available
No
Tags
interpretability, monosemanticity, dictionary-learning, superposition, anthropic, sparse-autoencoders, ai-safety, mechanistic-interpretability, transformer-models, feature-extraction
Added
2026-03-17
Completeness
0.9%

Index Score

65
Adoption
68
Quality
95
Freshness
75
Citations
75
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service