Skip to main content
brand
context
industry
strategy
AaaS
BenchmarkAI Ethics & Safetyv1.0

BBQ (Bias Benchmark for QA)

by Parrish et al. / NYU · free · Last verified 2026-03-17

BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.

https://github.com/nyu-mll/BBQ
B
BAbove Average
Adoption: B+Quality: AFreshness: BCitations: B+Engagement: F

Specifications

License
CC BY 4.0
Pricing
free
Capabilities
Social Bias Measurement, Stereotype Reliance Analysis, Question Answering Evaluation, Model Robustness Testing, Fairness Auditing, Comparative Model Analysis, Disambiguation Performance Assessment
Integrations
Use Cases
[object Object], [object Object], [object Object], [object Object]
API Available
No
Evaluated Models
gpt-4o, claude-opus-4, llama-3-70b, gemini-2-5-pro
Metrics
accuracy, bias-score
Methodology
58,492 questions spanning age, disability, gender, nationality, race, religion, sexual orientation, physical appearance, and socioeconomic status. Bias score measures over-reliance on stereotypes in ambiguous contexts (lower is better).
Last Run
2026-01-28
Tags
bias, qa, social-bias, disambiguation, fairness, ai-ethics, nlp-benchmark, stereotype-detection, model-evaluation, responsible-ai
Added
2026-03-17
Completeness
0.9%

Index Score

64.6
Adoption
70
Quality
88
Freshness
67
Citations
76
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service