Common Crawl Domain Names

2 min read Original article ↗

Published July 13, 2025 | Version 1.0.0

Dataset Open

Description

Details

This dataset provides a lightweight extraction of approximately 200 million org-level (registerable) domains referenced in Common Crawl's main indexes. The aim is to provide an open, free and mostly useful subset of global DNS entries -- web sites that were in use at the time of crawling.

To keep data minimal and useful at smaller scales, it contains only 3 columns: the domain name, the time first crawled and the time last crawled. Data is provided for each year and the entire range, and filtered.tsv.gz contains a smaller subset with domains that were only seen once removed.

Source code

The scripts to recreate this dataset can be found in the project's github repo.

Caveats

This data has not yet been tested in production apps, and so may contain errors. Two known issues are:

  • URL encoded names have not been deduplicated.
  • The extractor uses a naive rule to find the main domain (first domain that's longer than 3 characters, and would not create more than 3 subdomains). This may result in lost or duplicate data.

Contact

To report issues, suggest changes, or collaborate, open an issue or pull request on the project's repository.

Files

Files (40.8 GB)

Additional details