mirror of
https://github.com/end-of-term/eot2024
synced 2024-11-25 07:43:42 +01:00
Common Crawl seeds (#3)
* Common Crawl Foundation seeds * clean mil list to just hostnames * doc: add location of ccf repo that generated these files --------- Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
This commit is contained in:
parent
4392d90188
commit
ba124bec62
@ -49,3 +49,10 @@ The End of Term Web Archive team compiled a list of sources on the Web from whic
|
|||||||
* us-government-website-directory.csv - Pulled from [GSA's federal-website-directory repo](https://github.com/GSA/federal-website-directory) (via https://raw.githubusercontent.com/GSA/federal-website-directory/main/us-government-website-directory.csv on 9/12/2024). The repo README indicates "The Federal Website Directory is a comprehensive list of the public-facing websites of the U.S. Federal Government, spanning all three branches".
|
* us-government-website-directory.csv - Pulled from [GSA's federal-website-directory repo](https://github.com/GSA/federal-website-directory) (via https://raw.githubusercontent.com/GSA/federal-website-directory/main/us-government-website-directory.csv on 9/12/2024). The repo README indicates "The Federal Website Directory is a comprehensive list of the public-facing websites of the U.S. Federal Government, spanning all three branches".
|
||||||
* dotmil_websites.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/blob/main/data/dataset/dotmil_websites.csv on 9/12/2024)
|
* dotmil_websites.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/blob/main/data/dataset/dotmil_websites.csv on 9/12/2024)
|
||||||
* 2_govt_urls_federal_only.csv - Pulled from [GSA's govt-urls repo](https://github.com/GSA/govt-urls/) (via https://raw.githubusercontent.com/GSA/govt-urls/main/2_govt_urls_federal_only.csv on 9/12/2024). The README indicates the repo "contains the list of public government managed domains that exist outside of the top-level .gov and .mil domains."
|
* 2_govt_urls_federal_only.csv - Pulled from [GSA's govt-urls repo](https://github.com/GSA/govt-urls/) (via https://raw.githubusercontent.com/GSA/govt-urls/main/2_govt_urls_federal_only.csv on 9/12/2024). The README indicates the repo "contains the list of public government managed domains that exist outside of the top-level .gov and .mil domains."
|
||||||
|
|
||||||
|
### Common Crawl Foundation seeds
|
||||||
|
|
||||||
|
See [commoncrawl/ccf-eot-seeds-2024](https://github.com/commoncrawl/ccf-eot-seeds-2024) for details.
|
||||||
|
|
||||||
|
* ccf-gov-federal-web-graph-2024-jun-jul-aug.txt -- all .gov federal hostnames from current-federal.csv domains in CCF's 2024 June/July/August web graph
|
||||||
|
* ccf-mil-web-graph-2024-jun-jul-aug.txt -- all .mil hostnames from CCF's 2024 June/July/August web graph
|
||||||
|
41038
seed-lists/ccf-gov-federal-web-graph-2024-jun-jul-aug.txt
Normal file
41038
seed-lists/ccf-gov-federal-web-graph-2024-jun-jul-aug.txt
Normal file
File diff suppressed because it is too large
Load Diff
10249
seed-lists/ccf-mil-web-graph-2024-jun-jul-aug.txt
Normal file
10249
seed-lists/ccf-mil-web-graph-2024-jun-jul-aug.txt
Normal file
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue
Block a user