BenchmarkLLMsv1.0

LongBench

by Bai et al. / Tsinghua University · free · Last verified 2026-03-17

LongBench is a comprehensive bilingual benchmark designed to evaluate the long-context understanding capabilities of large language models in English and Chinese. It comprises 21 diverse tasks, including single and multi-document QA, summarization, and code completion, with an average context length of over 6,700 tokens to rigorously test model performance on extended inputs.

https://github.com/THUDM/LongBench ↗

B—Above Average

Adoption: BQuality: AFreshness: B+Citations: B+Engagement: F

Specifications

License: MIT
Pricing: free
Capabilities: Long-context understanding evaluation, Bilingual (English/Chinese) model assessment, Multi-task performance benchmarking, Single-document question answering, Multi-document question answering, Abstractive summarization evaluation, Few-shot learning assessment, Code completion proficiency testing
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: gpt-4o, claude-opus-4, gemini-2-5-pro, chatglm2-6b
Metrics: f1-score, rouge-l, accuracy
Methodology: 4,750 examples across 21 tasks sourced from existing datasets and new annotations. Automatic metrics (F1, ROUGE-L, accuracy) used per task type; results macro-averaged across tasks.
Last Run: 2026-01-30
Tags: long-context, bilingual, multi-task, qa, summarization, llm-evaluation, benchmark, natural-language-processing, code-completion, few-shot-learning, chinese-nlp
Added: 2026-03-17
Completeness: 0.85%

Index Score

64.5

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service