Skip to main content
BenchmarkLLMsv1.0

InfiniteBench

by Zhang et al. / Peking University · open-source · Last verified 2026-03-17

InfiniteBench evaluates LLM performance on tasks requiring processing of over 100,000 tokens, pushing well beyond the context windows of most models. It covers math, novels, code debugging, and retrieval tasks designed to require understanding of information distributed across an extremely long context.

https://github.com/OpenBMB/InfiniteBench
C+
C+Average
Adoption: BQuality: AFreshness: B+Citations: BEngagement: F

Specifications

License
Apache-2.0
Pricing
open-source
Capabilities
evaluation, long-context-evaluation, ultra-long-context
Integrations
Use Cases
model-evaluation, long-context-ai
API Available
No
Evaluated Models
gpt-4o, claude-opus-4, gemini-2-5-pro
Metrics
accuracy, score
Methodology
Average context length of 200K tokens. Tasks span En.QA, En.MC, En.Dia (dialogue), Zh.QA, code debug, math.Calc, and retrieval. Models process the full context; accuracy computed per task then macro-averaged.
Last Run
2026-02-12
Tags
long-context, 200k-tokens, retrieval, math, novels
Added
2026-03-17
Completeness
100%

Index Score

59.6
Adoption
62
Quality
89
Freshness
79
Citations
68
Engagement
0

Explore the full AI ecosystem on Agents as a Service