GAIA Benchmark
by Meta / Hugging Face · open-source · Last verified 2026-03-01
General AI Assistants benchmark testing models on real-world questions requiring multi-step reasoning, web browsing, tool use, and multi-modal understanding. Questions have unambiguous factual answers but require complex reasoning chains to solve.
https://huggingface.co/gaia-benchmark ↗B
B—Above Average
Adoption: B+Quality: AFreshness: ACitations: BEngagement: F
Specifications
- License
- CC-BY-4.0
- Pricing
- open-source
- Capabilities
- agent-evaluation, multi-step-reasoning-testing, tool-use-assessment
- Integrations
- huggingface
- Use Cases
- general-ai-evaluation, multi-step-reasoning-testing, agent-capability-assessment
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3
- Metrics
- accuracy-level-1, accuracy-level-2, accuracy-level-3
- Methodology
- 466 questions at 3 difficulty levels requiring web browsing, file understanding, and multi-step reasoning. Each question has a single verifiable answer evaluated by exact match.
- Last Run
- 2026-02-25
- Tags
- benchmark, evaluation, agents, general-ai, multi-step
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
62.2Adoption
70
Quality
86
Freshness
84
Citations
68
Engagement
0