Common Crawl seeds (#3)

* Common Crawl Foundation seeds

* clean mil list to just hostnames

* doc: add location of ccf repo that generated these files

---------

Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
This commit is contained in:
Greg Lindahl 2024-09-16 14:33:58 +00:00 committed by GitHub
parent 4392d90188
commit ba124bec62
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 51294 additions and 0 deletions

View File

@ -49,3 +49,10 @@ The End of Term Web Archive team compiled a list of sources on the Web from whic
* us-government-website-directory.csv - Pulled from [GSA's federal-website-directory repo](https://github.com/GSA/federal-website-directory) (via https://raw.githubusercontent.com/GSA/federal-website-directory/main/us-government-website-directory.csv on 9/12/2024). The repo README indicates "The Federal Website Directory is a comprehensive list of the public-facing websites of the U.S. Federal Government, spanning all three branches". * us-government-website-directory.csv - Pulled from [GSA's federal-website-directory repo](https://github.com/GSA/federal-website-directory) (via https://raw.githubusercontent.com/GSA/federal-website-directory/main/us-government-website-directory.csv on 9/12/2024). The repo README indicates "The Federal Website Directory is a comprehensive list of the public-facing websites of the U.S. Federal Government, spanning all three branches".
* dotmil_websites.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/blob/main/data/dataset/dotmil_websites.csv on 9/12/2024) * dotmil_websites.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/blob/main/data/dataset/dotmil_websites.csv on 9/12/2024)
* 2_govt_urls_federal_only.csv - Pulled from [GSA's govt-urls repo](https://github.com/GSA/govt-urls/) (via https://raw.githubusercontent.com/GSA/govt-urls/main/2_govt_urls_federal_only.csv on 9/12/2024). The README indicates the repo "contains the list of public government managed domains that exist outside of the top-level .gov and .mil domains." * 2_govt_urls_federal_only.csv - Pulled from [GSA's govt-urls repo](https://github.com/GSA/govt-urls/) (via https://raw.githubusercontent.com/GSA/govt-urls/main/2_govt_urls_federal_only.csv on 9/12/2024). The README indicates the repo "contains the list of public government managed domains that exist outside of the top-level .gov and .mil domains."
### Common Crawl Foundation seeds
See [commoncrawl/ccf-eot-seeds-2024](https://github.com/commoncrawl/ccf-eot-seeds-2024) for details.
* ccf-gov-federal-web-graph-2024-jun-jul-aug.txt -- all .gov federal hostnames from current-federal.csv domains in CCF's 2024 June/July/August web graph
* ccf-mil-web-graph-2024-jun-jul-aug.txt -- all .mil hostnames from CCF's 2024 June/July/August web graph

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff