Skip to main content
BenchmarkAI Ethics & Safetyv1.0

ToxiGen

by Hartvigsen et al. / MIT · open-source · Last verified 2026-03-17

ToxiGen is a large-scale machine-generated dataset of toxic and benign statements about 13 minority groups. It is used to evaluate whether models can distinguish implicit hate speech from benign text without relying on explicit slurs or surface-level cues.

https://github.com/microsoft/ToxiGen
B
BAbove Average
Adoption: B+Quality: AFreshness: BCitations: B+Engagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
evaluation, toxicity-detection, hate-speech-classification
Integrations
Use Cases
model-evaluation, ai-safety, content-moderation
API Available
No
Evaluated Models
gpt-4o, claude-opus-4, roberta-large, llama-3-70b
Metrics
accuracy, toxicity-rate, false-positive-rate
Methodology
274,000 statements generated via GPT-3 with ALICE prompting. Human-annotated subsets are used for classification evaluation. Models output binary toxic/benign predictions; accuracy and false-positive rate are the primary metrics.
Last Run
2026-01-15
Tags
toxicity, hate-speech, bias, safety, implicit-hate
Added
2026-03-17
Completeness
100%

Index Score

66.4
Adoption
73
Quality
87
Freshness
68
Citations
79
Engagement
0

Explore the full AI ecosystem on Agents as a Service