Skip to main content
BenchmarkAI Ethics & Safetyv1.0

CrowS-Pairs

by Nangia et al. / NYU · open-source · Last verified 2026-03-17

CrowS-Pairs is a challenge dataset of 1,508 sentence pairs targeting stereotypical and anti-stereotypical statements across nine types of bias. It evaluates masked language models by measuring pseudo-log-likelihood scores to determine whether a model prefers stereotypical completions.

https://github.com/nyu-mll/crows-pairs
B
BAbove Average
Adoption: BQuality: AFreshness: C+Citations: AEngagement: F

Specifications

License
CC BY-SA 4.0
Pricing
open-source
Capabilities
evaluation, bias-measurement, masked-lm-evaluation
Integrations
Use Cases
model-evaluation, ai-safety, bias-auditing
API Available
No
Evaluated Models
roberta-large, bert-large, gpt-2, llama-3-70b
Metrics
stereotype-score
Methodology
Each pair presents a more and less stereotypical sentence differing only in the target group. Stereotype score is the percentage of examples where the model assigns higher pseudo-log-likelihood to the stereotypical sentence (50% = no bias).
Last Run
2025-10-01
Tags
bias, stereotypes, masked-lm, fairness, social-bias
Added
2026-03-17
Completeness
100%

Index Score

62
Adoption
65
Quality
80
Freshness
55
Citations
80
Engagement
0

Explore the full AI ecosystem on Agents as a Service