BenchmarkAI Ethics & Safetyv1.0

CrowS-Pairs

by Nangia et al. / NYU · free · Last verified 2026-03-17

CrowS-Pairs is a benchmark dataset for evaluating social bias in masked language models. It contains 1,508 sentence pairs with stereotypical and anti-stereotypical statements across nine bias types. The benchmark measures a model's preference for stereotypical completions using pseudo-log-likelihood scores.

https://github.com/nyu-mll/crows-pairs ↗

B—Above Average

Adoption: BQuality: AFreshness: C+Citations: AEngagement: F

Specifications

License: CC BY-SA 4.0
Pricing: free
Capabilities: social-bias-evaluation, stereotype-detection-in-lms, masked-language-model-probing, pseudo-log-likelihood-scoring, comparative-model-analysis, bias-quantification, fairness-auditing
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: roberta-large, bert-large, gpt-2, llama-3-70b
Metrics: stereotype-score
Methodology: Each pair presents a more and less stereotypical sentence differing only in the target group. Stereotype score is the percentage of examples where the model assigns higher pseudo-log-likelihood to the stereotypical sentence (50% = no bias).
Last Run: 2025-10-01
Tags: bias, stereotypes, masked-lm, fairness, social-bias, nlp-benchmark, ai-ethics, model-evaluation, language-model-probing, dataset
Added: 2026-03-17
Completeness: 1%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service