Skip to main content
DatasetLLMsv1.0

CC-News

by CommonCrawl Foundation · open-source · Last verified 2026-03-17

A news article dataset derived from the News section of CommonCrawl, containing approximately 708,000 English news articles collected between September 2016 and March 2019. Used as a key component in the RoBERTa pretraining mix, it provides a domain-specific corpus for training models on journalistic writing style and current events language.

https://huggingface.co/datasets/cc_news
B
BAbove Average
Adoption: B+Quality: B+Freshness: DCitations: B+Engagement: F

Specifications

License
Custom
Pricing
open-source
Capabilities
language-modeling, news-understanding, temporal-reasoning
Integrations
hugging-face, tensorflow-datasets
Use Cases
llm-pretraining, news-classification, temporal-analysis
API Available
Yes
Tags
nlp, news, web-crawl, current-events, roberta
Added
2026-03-17
Completeness
100%

Index Score

63.3
Adoption
70
Quality
79
Freshness
35
Citations
78
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service