Skip to main content
BenchmarkLLMsv1.0

LongBench

by Bai et al. / Tsinghua University · open-source · Last verified 2026-03-17

LongBench is the first bilingual (English/Chinese) benchmark for evaluating long-context understanding in LLMs. It covers 21 diverse tasks including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion, with context lengths averaging 6,711 tokens.

https://github.com/THUDM/LongBench
B
BAbove Average
Adoption: BQuality: AFreshness: B+Citations: B+Engagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
evaluation, long-context-evaluation, bilingual-evaluation
Integrations
Use Cases
model-evaluation, long-context-ai, multilingual-nlp
API Available
No
Evaluated Models
gpt-4o, claude-opus-4, gemini-2-5-pro, chatglm2-6b
Metrics
f1-score, rouge-l, accuracy
Methodology
4,750 examples across 21 tasks sourced from existing datasets and new annotations. Automatic metrics (F1, ROUGE-L, accuracy) used per task type; results macro-averaged across tasks.
Last Run
2026-01-30
Tags
long-context, bilingual, multi-task, qa, summarization
Added
2026-03-17
Completeness
100%

Index Score

64.5
Adoption
69
Quality
88
Freshness
74
Citations
77
Engagement
0

Explore the full AI ecosystem on Agents as a Service