Skip to main content
BenchmarkLLMsv1.0

RULER

by Hsieh et al. / NVIDIA · open-source · Last verified 2026-03-17

RULER (Retrieval Under Long-context Evaluation Regime) is a synthetic long-context benchmark that scales from 4K to 128K tokens. It tests multi-hop retrieval, question answering, aggregation, and coreference resolution, providing a more nuanced view than single-needle retrieval tests.

https://github.com/hsiehjackson/RULER
B
BAbove Average
Adoption: B+Quality: A+Freshness: ACitations: B+Engagement: F

Specifications

License
Apache-2.0
Pricing
open-source
Capabilities
evaluation, long-context-evaluation, retrieval-testing
Integrations
Use Cases
model-evaluation, long-context-ai
API Available
No
Evaluated Models
gpt-4o, claude-opus-4, gemini-2-5-pro, llama-3-70b
Metrics
accuracy
Methodology
Synthetic tasks generated at configurable context lengths (4K–128K). Four task categories: NIAH (single/multi-key/multi-value), variable tracking, aggregation, and QA. Averaged accuracy across categories at each context length.
Last Run
2026-02-28
Tags
long-context, retrieval, needle-in-haystack, synthetic, scalable
Added
2026-03-17
Completeness
100%

Index Score

65.2
Adoption
71
Quality
90
Freshness
82
Citations
75
Engagement
0

Explore the full AI ecosystem on Agents as a Service