BenchmarkLLMsv1.0

DROP

by Allen AI · free · Last verified 2026-03-01

DROP (Discrete Reasoning Over Paragraphs) is a challenging benchmark designed to evaluate a model's numerical reasoning capabilities within textual contexts. It requires systems to read paragraphs and answer questions that involve discrete operations like addition, counting, sorting, or comparison. Unlike simpler QA datasets, DROP necessitates multi-step reasoning processes, pushing models beyond basic information retrieval.

https://allenai.org/data/drop ↗

C—Below Average

Adoption: B+Quality: AFreshness: B+Citations: FEngagement: F

Specifications

License: Apache-2.0
Pricing: free
Capabilities: multi-step reasoning evaluation, numerical reasoning assessment, arithmetic operation testing (addition, subtraction), counting and sorting validation, comparative reasoning analysis, information extraction from complex passages, negation handling in questions, coreference resolution testing
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3
Metrics: f1-score, exact-match
Methodology: Reading comprehension with questions requiring discrete reasoning operations like counting, sorting, and arithmetic over passage content.
Last Run: 2026-01-25
Tags: benchmark, dataset, evaluation, reading-comprehension, reasoning, numerical, question-answering, natural-language-processing, arithmetic-reasoning, multi-step-reasoning
Added: 2026-03-17
Completeness: 80%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service