CC-News
by CommonCrawl Foundation · free · Last verified 2026-03-17
CC-News is a large-scale dataset of over 700,000 English news articles from the CommonCrawl archive, collected between 2016 and 2019. It serves as a key pretraining corpus, notably for the RoBERTa model, providing a rich source of journalistic text for developing models that understand news language and current events.
https://huggingface.co/datasets/cc_news ↗B
B—Above Average
Adoption: B+Quality: B+Freshness: DCitations: B+Engagement: F
Specifications
- License
- Custom
- Pricing
- free
- Capabilities
- Large-scale language model pretraining, Domain adaptation for news text, News article summarization, Topic modeling and trend analysis, Named entity recognition (NER) on journalistic content, Event detection from text, Training text generation models for news writing, Sentiment analysis of news coverage
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- Yes
- Tags
- nlp, news, web-crawl, roberta, text-corpus, language-modeling, unsupervised-learning, english-language, journalism, common-crawl
- Added
- 2026-03-17
- Completeness
- 0.9%
Index Score
63.3Adoption
70
Quality
79
Freshness
35
Citations
78
Engagement
0