Datasetlegalv1.0

Legal-BERT Training Data

by Gerasimos Spanakis / Maastricht University · open-source · Last verified 2026-03-17

The Legal-BERT training corpus is a large collection of English legal text assembled from UK legislation, EU legislation, ECHR/ECLI court decisions, and US contracts specifically curated to pretrain domain-adapted BERT models. It has enabled a family of Legal-BERT models that significantly outperform general-domain language models on legal NLP tasks.

https://huggingface.co/nlpaueb/legal-bert-base-uncased ↗

C—Below Average

Adoption: B+Quality: AFreshness: BCitations: FEngagement: F

Specifications

License: CC-BY-4.0
Pricing: open-source
Capabilities: legal-text-pretraining, contract-analysis, legal-classification, ner-legal
Integrations: HuggingFace Transformers
Use Cases: language-model-pretraining, legal-nlp-research, contract-ai
API Available: No
Tags: legal-nlp, pretraining, contracts, court-decisions, legislation, BERT
Added: 2026-03-17
Completeness: 100%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service