BenchmarkLLMsv1.0

IFEval

by Google Research · open-source · Last verified 2026-03-01

Instruction-Following Evaluation benchmark testing models' ability to precisely follow verifiable formatting instructions. Includes constraints like word count limits, specific formatting requirements, keyword inclusion/exclusion, and structural rules that can be programmatically verified.

https://github.com/google-research/google-research/tree/master/instruction_following_eval ↗

B—Above Average

Adoption: B+Quality: AFreshness: ACitations: B+Engagement: F

Specifications

License: Apache-2.0
Pricing: open-source
Capabilities: model-evaluation, instruction-following-testing, constraint-verification
Integrations: lm-eval-harness
Use Cases: instruction-compliance-testing, formatting-evaluation, constraint-following-assessment
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics: prompt-level-accuracy, instruction-level-accuracy
Methodology: 541 prompts with verifiable instructions like word count, formatting, keyword constraints. Evaluated programmatically for exact compliance with each instruction.
Last Run: 2026-02-25
Tags: benchmark, evaluation, instruction-following, constraints, formatting
Added: 2026-03-17
Completeness: 100%

Index Score

64.3

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service