InfiniteBench
by Zhang et al. / Peking University · free · Last verified 2026-03-17
InfiniteBench is a benchmark designed to evaluate the long-context capabilities of large language models. It features tasks that require processing and reasoning over inputs exceeding 100,000 tokens, including math, code debugging, and retrieval from novels, where crucial information is distributed across the entire context.
https://github.com/OpenBMB/InfiniteBench ↗C+
C+—Average
Adoption: BQuality: AFreshness: B+Citations: BEngagement: F
Specifications
- License
- Apache-2.0
- Pricing
- free
- Capabilities
- Long-context evaluation, Ultra-long context testing (100k+ tokens), Key-value retrieval over long text, Long-context code debugging, Long-context mathematical reasoning, Narrative understanding across long documents, Performance measurement for long-context models, Cross-document information extraction
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Evaluated Models
- gpt-4o, claude-opus-4, gemini-2-5-pro
- Metrics
- accuracy, score
- Methodology
- Average context length of 200K tokens. Tasks span En.QA, En.MC, En.Dia (dialogue), Zh.QA, code debug, math.Calc, and retrieval. Models process the full context; accuracy computed per task then macro-averaged.
- Last Run
- 2026-02-12
- Tags
- long-context, llm-evaluation, benchmark, ultra-long-context, retrieval, math, code-debugging, nlp, context-window, ai-research
- Added
- 2026-03-17
- Completeness
- 0.8%
Index Score
59.6Adoption
62
Quality
89
Freshness
79
Citations
68
Engagement
0