BenchmarkLLMsv1.0

SimpleQA

by OpenAI · free · Last verified 2026-03-01

SimpleQA is a benchmark dataset developed by OpenAI to assess the factual accuracy of language models. It consists of simple, unambiguous questions that have a single, verifiable correct answer. The benchmark is designed to measure a model's ability to recall factual knowledge and, crucially, to abstain from answering when it is uncertain, providing a measure of its calibration.

https://openai.com/research/simple-qa ↗

B—Above Average

Adoption: BQuality: AFreshness: A+Citations: BEngagement: F

Specifications

License: MIT
Pricing: free
Capabilities: Factual knowledge recall testing, Language model accuracy measurement, Model calibration assessment, Benchmarking against established models, Identifying knowledge gaps in LLMs, Evaluating model's ability to abstain from answering, Standardized scoring for model comparison
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3
Metrics: accuracy, calibration, abstention-rate
Methodology: Simple factual questions with verified single correct answers. Measures both accuracy and whether models appropriately decline to answer when uncertain.
Last Run: 2026-03-01
Tags: benchmark, evaluation, factuality, qa, knowledge, openai, llm-evaluation, factual-accuracy, calibration, question-answering, knowledge-recall
Added: 2026-03-17
Completeness: 0.9%

Index Score

60.4

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service