Skip to main content
brand
context
industry
strategy
AaaS
BenchmarkLLMsv1.0

LongBench

by Bai et al. / Tsinghua University · free · Last verified 2026-03-17

LongBench is a comprehensive bilingual benchmark designed to evaluate the long-context understanding capabilities of large language models in English and Chinese. It comprises 21 diverse tasks, including single and multi-document QA, summarization, and code completion, with an average context length of over 6,700 tokens to rigorously test model performance on extended inputs.

https://github.com/THUDM/LongBench
B
BAbove Average
Adoption: BQuality: AFreshness: B+Citations: B+Engagement: F

Specifications

License
MIT
Pricing
free
Capabilities
Long-context understanding evaluation, Bilingual (English/Chinese) model assessment, Multi-task performance benchmarking, Single-document question answering, Multi-document question answering, Abstractive summarization evaluation, Few-shot learning assessment, Code completion proficiency testing
Integrations
Use Cases
[object Object], [object Object], [object Object], [object Object]
API Available
No
Evaluated Models
gpt-4o, claude-opus-4, gemini-2-5-pro, chatglm2-6b
Metrics
f1-score, rouge-l, accuracy
Methodology
4,750 examples across 21 tasks sourced from existing datasets and new annotations. Automatic metrics (F1, ROUGE-L, accuracy) used per task type; results macro-averaged across tasks.
Last Run
2026-01-30
Tags
long-context, bilingual, multi-task, qa, summarization, llm-evaluation, benchmark, natural-language-processing, code-completion, few-shot-learning, chinese-nlp
Added
2026-03-17
Completeness
0.85%

Index Score

64.5
Adoption
69
Quality
88
Freshness
74
Citations
77
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service