CC-News
by CommonCrawl Foundation · open-source · Last verified 2026-03-17
A news article dataset derived from the News section of CommonCrawl, containing approximately 708,000 English news articles collected between September 2016 and March 2019. Used as a key component in the RoBERTa pretraining mix, it provides a domain-specific corpus for training models on journalistic writing style and current events language.
https://huggingface.co/datasets/cc_news ↗B
B—Above Average
Adoption: B+Quality: B+Freshness: DCitations: B+Engagement: F
Specifications
- License
- Custom
- Pricing
- open-source
- Capabilities
- language-modeling, news-understanding, temporal-reasoning
- Integrations
- hugging-face, tensorflow-datasets
- Use Cases
- llm-pretraining, news-classification, temporal-analysis
- API Available
- Yes
- Tags
- nlp, news, web-crawl, current-events, roberta
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
63.3Adoption
70
Quality
79
Freshness
35
Citations
78
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.