DatasetLLMsv1.0

CC-News

by CommonCrawl Foundation · free · Last verified 2026-03-17

CC-News is a large-scale dataset of over 700,000 English news articles from the CommonCrawl archive, collected between 2016 and 2019. It serves as a key pretraining corpus, notably for the RoBERTa model, providing a rich source of journalistic text for developing models that understand news language and current events.

https://huggingface.co/datasets/cc_news ↗

C—Below Average

Adoption: B+Quality: B+Freshness: DCitations: FEngagement: F

Specifications

License: Custom
Pricing: free
Capabilities: Large-scale language model pretraining, Domain adaptation for news text, News article summarization, Topic modeling and trend analysis, Named entity recognition (NER) on journalistic content, Event detection from text, Training text generation models for news writing, Sentiment analysis of news coverage
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object], [object Object]
API Available: Yes
Tags: nlp, news, web-crawl, roberta, text-corpus, language-modeling, unsupervised-learning, english-language, journalism, common-crawl
Added: 2026-03-17
Completeness: 0.9%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service