InfiniteBench
by Zhang et al. / Peking University · open-source · Last verified 2026-03-17
InfiniteBench evaluates LLM performance on tasks requiring processing of over 100,000 tokens, pushing well beyond the context windows of most models. It covers math, novels, code debugging, and retrieval tasks designed to require understanding of information distributed across an extremely long context.
https://github.com/OpenBMB/InfiniteBench ↗C+
C+—Average
Adoption: BQuality: AFreshness: B+Citations: BEngagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- evaluation, long-context-evaluation, ultra-long-context
- Integrations
- Use Cases
- model-evaluation, long-context-ai
- API Available
- No
- Evaluated Models
- gpt-4o, claude-opus-4, gemini-2-5-pro
- Metrics
- accuracy, score
- Methodology
- Average context length of 200K tokens. Tasks span En.QA, En.MC, En.Dia (dialogue), Zh.QA, code debug, math.Calc, and retrieval. Models process the full context; accuracy computed per task then macro-averaged.
- Last Run
- 2026-02-12
- Tags
- long-context, 200k-tokens, retrieval, math, novels
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
59.6Adoption
62
Quality
89
Freshness
79
Citations
68
Engagement
0