BenchmarkAI Ethics & Safetyv1.0

WinoBias

by Zhao et al. / USC · free · Last verified 2026-03-17

WinoBias is a benchmark dataset designed to measure gender bias in coreference resolution systems. It consists of sentence pairs where pronouns refer to individuals in stereotyped or non-stereotyped occupations, allowing for the quantification of a model's reliance on gender stereotypes versus grammatical correctness.

https://github.com/uclanlp/corefBias ↗

C+

C+—Average

Adoption: BQuality: B+Freshness: C+Citations: B+Engagement: F

Specifications

License: MIT
Pricing: free
Capabilities: gender bias measurement, coreference resolution evaluation, stereotype detection in language models, pronoun resolution analysis, fairness auditing for NLP, comparative model analysis
Integrations: [object Object], [object Object]
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: gpt-4o, claude-opus-4, spacy-lg
Metrics: f1-score, gender-bias-gap
Methodology: 3,160 sentences split between pro-stereotypical (Type 1) and anti-stereotypical (Type 2) configurations. Gender bias gap is computed as the F1 difference between pro- and anti-stereotypical conditions; smaller gap indicates less bias.
Last Run: 2025-09-15
Tags: bias, gender-bias, coreference, fairness, pronoun, nlp, ai-ethics, evaluation-dataset, responsible-ai, language-model-testing, english
Added: 2026-03-17
Completeness: 1%

Index Score

59.8

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service