Paperinterpretabilityv1.0

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

by Anthropic · free · Last verified 2026-03-17

This research paper from Anthropic introduces a method using sparse autoencoders to decompose the internal activations of a transformer model. It successfully extracts thousands of interpretable, monosemantic features, demonstrating that the superposition of concepts within neurons can be untangled.

https://transformer-circuits.pub/2023/monosemantic-features/index.html ↗

B—Above Average

Adoption: BQuality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License: Open Access
Pricing: free
Capabilities: interpretable-feature-extraction, sparse-autoencoder-training, dictionary-learning-for-llms, superposition-hypothesis-analysis, causal-tracing-via-activation-patching, model-activation-decomposition, abstract-concept-identification
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Tags: interpretability, monosemanticity, dictionary-learning, superposition, anthropic, sparse-autoencoders, ai-safety, mechanistic-interpretability, transformer-models, feature-extraction
Added: 2026-03-17
Completeness: 0.9%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service