LongBench
by Bai et al. / Tsinghua University · free · Last verified 2026-03-17
LongBench is a comprehensive bilingual benchmark designed to evaluate the long-context understanding capabilities of large language models in English and Chinese. It comprises 21 diverse tasks, including single and multi-document QA, summarization, and code completion, with an average context length of over 6,700 tokens to rigorously test model performance on extended inputs.
https://github.com/THUDM/LongBench ↗B
B—Above Average
Adoption: BQuality: AFreshness: B+Citations: B+Engagement: F
Specifications
- License
- MIT
- Pricing
- free
- Capabilities
- Long-context understanding evaluation, Bilingual (English/Chinese) model assessment, Multi-task performance benchmarking, Single-document question answering, Multi-document question answering, Abstractive summarization evaluation, Few-shot learning assessment, Code completion proficiency testing
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Evaluated Models
- gpt-4o, claude-opus-4, gemini-2-5-pro, chatglm2-6b
- Metrics
- f1-score, rouge-l, accuracy
- Methodology
- 4,750 examples across 21 tasks sourced from existing datasets and new annotations. Automatic metrics (F1, ROUGE-L, accuracy) used per task type; results macro-averaged across tasks.
- Last Run
- 2026-01-30
- Tags
- long-context, bilingual, multi-task, qa, summarization, llm-evaluation, benchmark, natural-language-processing, code-completion, few-shot-learning, chinese-nlp
- Added
- 2026-03-17
- Completeness
- 0.85%
Index Score
64.5Adoption
69
Quality
88
Freshness
74
Citations
77
Engagement
0