ToxiGen
by Hartvigsen et al. / MIT · open-source · Last verified 2026-03-17
ToxiGen is a large-scale machine-generated dataset of toxic and benign statements about 13 minority groups. It is used to evaluate whether models can distinguish implicit hate speech from benign text without relying on explicit slurs or surface-level cues.
https://github.com/microsoft/ToxiGen ↗B
B—Above Average
Adoption: B+Quality: AFreshness: BCitations: B+Engagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- evaluation, toxicity-detection, hate-speech-classification
- Integrations
- Use Cases
- model-evaluation, ai-safety, content-moderation
- API Available
- No
- Evaluated Models
- gpt-4o, claude-opus-4, roberta-large, llama-3-70b
- Metrics
- accuracy, toxicity-rate, false-positive-rate
- Methodology
- 274,000 statements generated via GPT-3 with ALICE prompting. Human-annotated subsets are used for classification evaluation. Models output binary toxic/benign predictions; accuracy and false-positive rate are the primary metrics.
- Last Run
- 2026-01-15
- Tags
- toxicity, hate-speech, bias, safety, implicit-hate
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
66.4Adoption
73
Quality
87
Freshness
68
Citations
79
Engagement
0