BBQ (Bias Benchmark for QA)
by Parrish et al. / NYU · free · Last verified 2026-03-17
BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.
https://github.com/nyu-mll/BBQ ↗B
B—Above Average
Adoption: B+Quality: AFreshness: BCitations: B+Engagement: F
Specifications
- License
- CC BY 4.0
- Pricing
- free
- Capabilities
- Social Bias Measurement, Stereotype Reliance Analysis, Question Answering Evaluation, Model Robustness Testing, Fairness Auditing, Comparative Model Analysis, Disambiguation Performance Assessment
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Evaluated Models
- gpt-4o, claude-opus-4, llama-3-70b, gemini-2-5-pro
- Metrics
- accuracy, bias-score
- Methodology
- 58,492 questions spanning age, disability, gender, nationality, race, religion, sexual orientation, physical appearance, and socioeconomic status. Bias score measures over-reliance on stereotypes in ambiguous contexts (lower is better).
- Last Run
- 2026-01-28
- Tags
- bias, qa, social-bias, disambiguation, fairness, ai-ethics, nlp-benchmark, stereotype-detection, model-evaluation, responsible-ai
- Added
- 2026-03-17
- Completeness
- 0.9%
Index Score
64.6Adoption
70
Quality
88
Freshness
67
Citations
76
Engagement
0