mirror of
https://github.com/end-of-term/eot2024
synced 2025-01-18 05:03:44 +01:00
Add CDC .html seed list
This commit is contained in:
parent
47e8f8eb67
commit
a6e38c7311
46045
seed-lists/CDC html URLs from sitemap data - 20241201.csv
Normal file
46045
seed-lists/CDC html URLs from sitemap data - 20241201.csv
Normal file
File diff suppressed because it is too large
Load Diff
@ -82,3 +82,4 @@ The End of Term Web Archive team compiled a list of sources on the Web from whic
|
||||
* dotmil_websites.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/blob/main/data/dataset/dotmil_websites.csv on 9/12/2024)
|
||||
* 2_govt_urls_federal_only.csv - Pulled from [GSA's govt-urls repo](https://github.com/GSA/govt-urls/) (via https://raw.githubusercontent.com/GSA/govt-urls/main/2_govt_urls_federal_only.csv on 9/12/2024). The README indicates the repo "contains the list of public government managed domains that exist outside of the top-level .gov and .mil domains."
|
||||
* usagov.csv - Seeds scraped from https://www.usa.gov/agency-index/ by Jake Abrams, Founder, CivicsUS, LLC.
|
||||
* CDC html URLs from sitemap data - 20241201.csv - file of about 46,000 .html URLs created by parsing the CDC's sitemap file at https://www.cdc.gov/wcms-auto-sitemap-index.xml, which then pointed to other sitemaps, which pointed to .html files.
|
||||
|
Loading…
Reference in New Issue
Block a user