Common Crawl

Common Crawl provides petabytes of open web crawl data, updated regularly. This Action Pack guides you through accessing and beginning to explore this vast dataset, which is foundational for training large language models and various research projects.

webcrawltraining-dataweb-crawldatasetbig-datallm-training-dataaws-s3

4 Steps

1
Identify a Recent Crawl: Browse the Common Crawl Index (e.g., https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/) to find a specific crawl identifier (e.g., `CC-MAIN-2023-50`) and its associated data files (WAT, WET, WARC).
2
Access Data on AWS S3: Common Crawl data is hosted on AWS S3. You can access it directly via `s3://commoncrawl/`. Ensure you have the AWS CLI installed and configured (`pip install awscli`).
3
Download a Sample WARC File: Locate a small WARC file path from the Common Crawl Index for your chosen crawl (e.g., `crawl-data/CC-MAIN-2023-50/segments/1701980838612/warc/CC-MAIN-20231207134714-20231207164714-00000.warc.gz`). Use the AWS CLI to download it to your current directory.
4
Parse WARC Content: Install the `warcio` Python library (`pip install warcio`). Use the following Python code to open and extract URLs from the downloaded `.warc.gz` file.

Ready to run this action pack?

Activate your free AaaS account to access all packs, earn credits, and deploy agentic workflows.

Get Started Free →

← Back to Academy