mirror of
https://github.com/end-of-term/eot2024
synced 2025-01-18 13:13:43 +01:00
cdc datasets urls from sitemap (#26)
CDC dataset URLs derived from this sitemap file: https://s3.amazonaws.com/sa-socrata-sitemaps-us-east-1-fedramp-prod/sitemaps/sitemap-datasets-data.cdc.gov0.xml. The format of this CSV file is: sitemap file the URL is sourced from,the URL. Note that the pages pointed to by these URLs usually include a download button to get a CSV of the dataset, but the CSVs themselves aren't included in this seed file and won't be retrieved in the crawl. But at least the existence of the dataset and links to its metadata will be documented in the archive.
This commit is contained in:
parent
ee6cba5868
commit
5a83e824d4
1418
seed-lists/cdc-dataset-urls.csv
Normal file
1418
seed-lists/cdc-dataset-urls.csv
Normal file
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue
Block a user