BenchmarkAI Ethics & Safetyv1.0

BBQ (Bias Benchmark for QA)

by Parrish et al. / NYU · free · Last verified 2026-03-17

BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.

https://github.com/nyu-mll/BBQ ↗

B—Above Average

Adoption: B+Quality: AFreshness: BCitations: B+Engagement: F

Specifications

License: CC BY 4.0
Pricing: free
Capabilities: Social Bias Measurement, Stereotype Reliance Analysis, Question Answering Evaluation, Model Robustness Testing, Fairness Auditing, Comparative Model Analysis, Disambiguation Performance Assessment
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: gpt-4o, claude-opus-4, llama-3-70b, gemini-2-5-pro
Metrics: accuracy, bias-score
Methodology: 58,492 questions spanning age, disability, gender, nationality, race, religion, sexual orientation, physical appearance, and socioeconomic status. Bias score measures over-reliance on stereotypes in ambiguous contexts (lower is better).
Last Run: 2026-01-28
Tags: bias, qa, social-bias, disambiguation, fairness, ai-ethics, nlp-benchmark, stereotype-detection, model-evaluation, responsible-ai
Added: 2026-03-17
Completeness: 0.9%

Index Score

64.6

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service