Skip to content
@commoncrawl

Common Crawl Foundation

Common Crawl provides an archive of webpages going back to 2007.

Pinned Loading

  1. cc-pyspark cc-pyspark Public

    Process Common Crawl data with Python and Spark

    Python 442 90

  2. cc-crawl-statistics cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 194 15

  3. cc-index-table cc-index-table Public

    Index Common Crawl archives in tabular format

    Java 122 14

  4. cc-warc-examples cc-warc-examples Public

    Forked from Smerity/cc-warc-examples

    CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    Java 37 18

  5. cc-citations cc-citations Public

    Scientific articles using or citing Common Crawl data

    Jupyter Notebook 25 5

  6. cc-notebooks cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    Jupyter Notebook 59 10

Repositories

Showing 10 of 73 repositories
  • cc-pyspark Public

    Process Common Crawl data with Python and Spark

    commoncrawl/cc-pyspark’s past year of commit activity
    Python 442 MIT 90 5 3 Updated Oct 20, 2025
  • whirlwind-python Public

    A whirlwind tour of Common Crawl's data using Python

    commoncrawl/whirlwind-python’s past year of commit activity
    Python 27 Apache-2.0 6 0 1 Updated Oct 20, 2025
  • cc-citations Public

    Scientific articles using or citing Common Crawl data

    commoncrawl/cc-citations’s past year of commit activity
    Jupyter Notebook 25 5 0 0 Updated Oct 19, 2025
  • robotstxt-experiments Public

    How is the Robots Exclusion Protocol (robots.txt) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.

    commoncrawl/robotstxt-experiments’s past year of commit activity
    Jupyter Notebook 0 MIT 0 0 0 Updated Oct 19, 2025
  • cc-host-index Public

    Tools for working with the host index

    commoncrawl/cc-host-index’s past year of commit activity
    Python 10 2 1 0 Updated Oct 18, 2025
  • cc-index-annotations Public

    Example code to join an annotation to a host or url index

    commoncrawl/cc-index-annotations’s past year of commit activity
    Python 1 0 0 0 Updated Oct 16, 2025
  • cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    commoncrawl/cc-crawl-statistics’s past year of commit activity
    Python 194 Apache-2.0 15 1 2 Updated Oct 14, 2025
  • cc-nutch-example Public

    Apache Nutch example project to archive content in WARC files

    commoncrawl/cc-nutch-example’s past year of commit activity
    Shell 3 Apache-2.0 2 0 0 Updated Oct 11, 2025
  • web-languages Public

    Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

    commoncrawl/web-languages’s past year of commit activity
    58 73 2 1 Updated Oct 9, 2025
  • cc-webgraph Public

    Tools to construct and process Common Crawl webgraphs

    commoncrawl/cc-webgraph’s past year of commit activity
    Java 99 Apache-2.0 4 2 (1 issue needs help) 0 Updated Oct 5, 2025

Top languages

Loading…

Most used topics

Loading…