BenchmarkLLMsv2.0

TruthfulQA

by University of Oxford · open-source · Last verified 2026-03-01

Measures whether language models generate truthful answers to questions where humans are commonly mistaken. Covers health, law, finance, and politics topics where popular misconceptions and conspiracies create systematic failure modes.

https://github.com/sylinrl/TruthfulQA ↗

B+

B+—Good

Adoption: AQuality: AFreshness: B+Citations: AEngagement: F

Specifications

License: Apache-2.0
Pricing: open-source
Capabilities: model-evaluation, truthfulness-testing, factuality-assessment
Integrations: lm-eval-harness
Use Cases: safety-evaluation, factuality-benchmarking, model-alignment-testing
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3
Metrics: truthful-rate, informative-rate, truthful-and-informative
Methodology: Open-ended generation questions where common misconceptions exist. Evaluated by fine-tuned judge models for truthfulness and informativeness.
Last Run: 2026-02-10
Tags: benchmark, evaluation, truthfulness, factuality, safety
Added: 2026-03-17
Completeness: 100%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service