BenchmarkLLMsv1.0

InfiniteBench

by Zhang et al. / Peking University · free · Last verified 2026-03-17

InfiniteBench is a benchmark designed to evaluate the long-context capabilities of large language models. It features tasks that require processing and reasoning over inputs exceeding 100,000 tokens, including math, code debugging, and retrieval from novels, where crucial information is distributed across the entire context.

https://github.com/OpenBMB/InfiniteBench ↗

C+

C+—Average

Adoption: BQuality: AFreshness: B+Citations: BEngagement: F

Specifications

License: Apache-2.0
Pricing: free
Capabilities: Long-context evaluation, Ultra-long context testing (100k+ tokens), Key-value retrieval over long text, Long-context code debugging, Long-context mathematical reasoning, Narrative understanding across long documents, Performance measurement for long-context models, Cross-document information extraction
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: gpt-4o, claude-opus-4, gemini-2-5-pro
Metrics: accuracy, score
Methodology: Average context length of 200K tokens. Tasks span En.QA, En.MC, En.Dia (dialogue), Zh.QA, code debug, math.Calc, and retrieval. Models process the full context; accuracy computed per task then macro-averaged.
Last Run: 2026-02-12
Tags: long-context, llm-evaluation, benchmark, ultra-long-context, retrieval, math, code-debugging, nlp, context-window, ai-research
Added: 2026-03-17
Completeness: 0.8%

Index Score

59.6

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service