Skip to main content
brand
context
industry
strategy
AaaS
BenchmarkLLMsv1.0

InfiniteBench

by Zhang et al. / Peking University · free · Last verified 2026-03-17

InfiniteBench is a benchmark designed to evaluate the long-context capabilities of large language models. It features tasks that require processing and reasoning over inputs exceeding 100,000 tokens, including math, code debugging, and retrieval from novels, where crucial information is distributed across the entire context.

https://github.com/OpenBMB/InfiniteBench
C+
C+Average
Adoption: BQuality: AFreshness: B+Citations: BEngagement: F

Specifications

License
Apache-2.0
Pricing
free
Capabilities
Long-context evaluation, Ultra-long context testing (100k+ tokens), Key-value retrieval over long text, Long-context code debugging, Long-context mathematical reasoning, Narrative understanding across long documents, Performance measurement for long-context models, Cross-document information extraction
Integrations
Use Cases
[object Object], [object Object], [object Object], [object Object]
API Available
No
Evaluated Models
gpt-4o, claude-opus-4, gemini-2-5-pro
Metrics
accuracy, score
Methodology
Average context length of 200K tokens. Tasks span En.QA, En.MC, En.Dia (dialogue), Zh.QA, code debug, math.Calc, and retrieval. Models process the full context; accuracy computed per task then macro-averaged.
Last Run
2026-02-12
Tags
long-context, llm-evaluation, benchmark, ultra-long-context, retrieval, math, code-debugging, nlp, context-window, ai-research
Added
2026-03-17
Completeness
0.8%

Index Score

59.6
Adoption
62
Quality
89
Freshness
79
Citations
68
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service