CDC dataset URLs derived from this sitemap file: https://s3.amazonaws.com/sa-socrata-sitemaps-us-east-1-fedramp-prod/sitemaps/sitemap-datasets-data.cdc.gov0.xml. The format of this CSV file is: sitemap file the URL is sourced from,the URL.
Note that the pages pointed to by these URLs usually include a download button to get a CSV of the dataset, but the CSVs themselves aren't included in this seed file and won't be retrieved in the crawl. But at least the existence of the dataset and links to its metadata will be documented in the archive.
* FDA HTML urls for seed list
This is a file of URLs derived from the FDA's sitemap.xml file. It has URLs of the form /media/*/download and the warning-letters filtered out, so in theory everything in this file is an HTML link. I plan on submitting the PDF content separately.
The format of this CSV file is: <sitemap file the URL is sourced from>,<the URL>
* FDA warning letters from sitemap.xml
This is a file of URLs derived from the FDA's sitemap.xml file with warning letters content. I thought this was going to be PDF content, but it turns out to be HTML content.
The format of this CSV file is: sitemap file the URL is sourced from,the URL
* FDA download links from sitemaps
This is a file of URLs derived from the FDA's sitemap.xml file, where the link is of the form /media/id/download. These resolve to a PDF file rendered in a HTML wrapper using Mozilla's pdfjs library -- so the download in the path name is a little misleading.
The format of this CSV file is: sitemap file the URL is sourced from,the URL
* NIH urls from three sitemaps.
This is a file of URLs derived from three NIH sitemap files: https://www.nih.gov/sitemap.xml, https://newsinhealth.nih.gov/sitemap.xml and https://nihrecord.nih.gov/sitemap.xml.
The format of this CSV file is: sitemap file the URL is sourced from,the URL
* Update README.md
2 new lists from EDGI and Hermann-Wu submitted via dot-info@archive.org
* Add files via upload
2 new lists from EDGI and Hermann-Wu submitted via dot-info@archive.org
* Update README.md
bulk list re USDA seeds submitted by AWI 20241222
* Add files via upload
bulk list re USDA seeds submitted by AWI 20241222
* Update README.md
3 bulk lists added: 1 from Gary Price and 2 from eot-info submissions: AWI-XL-4-20241224.xlsx, AWI-USDA-FSIS-20241222.xlsx, NSF-20241224.xlsx
* Add files via upload
3 bulk lists added: 1 from Gary Price and 2 from eot-info submissions: AWI-XL-4-20241224.xlsx, AWI-USDA-FSIS-20241222.xlsx, NSF-20241224.xlsx
* Update README.md
list submitted by Natl Indian Law Library Natl-Indian-Law-Library-bulk-seeds-20241224.xlsx
* Add files via upload
list submitted by Natl Indian Law Library Natl-Indian-Law-Library-bulk-seeds-20241224.xlsx
* Update README.md
bulk list on performance.gov by Ailsa Hermann-Wu
* Add files via upload
Bulk list submitted by Ailsa Hermann-Wu re performance.gov 20241219
* Update README.md
added another bulk list from Gary Price/Infodocket. File is USDA_FIS_ERS.xlsx
* Add files via upload
USDA_FIS_ERS.xlsx from Gary Price/infodocket. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service.
* Update README.md
3 more bulk lists from Gary Price sent on 12/14/2024
* Add files via upload
3 new bulk lists from Gary Price submitted 12/14/2024
* Update README.md
added another bulk list from Gary Price/Infodocket. File is USDA_FIS_ERS.xlsx
* Add files via upload
USDA_FIS_ERS.xlsx from Gary Price/infodocket. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service.
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Forgot to add sitemaps.txt
Signed-off-by: Bentley Hensel <bentleyhensel@gmail.com>
---------
Signed-off-by: Bentley Hensel <bentleyhensel@gmail.com>
This is a csv file of PDF links obtained from webpages found on the US CDC website. It contains 46,873 links, with the format: the source HTML file containing the PDF link; the time in UTC in which the accessibility of the PDF file was confirmed; and a URL pointing to the PDF file itself.
This file replaces the two previous files. This file has had the PDF links deduped, so if multiple pages point to the same PDF, you'll only see an entry for the first reference. PDF links that point to non-gov domains have been omitted as well.If the PDF link contains a fragment, the fragment will be removed from the path (e.g. "/a/path/mypdf.pdf#page=3" will get turned into "/a/path/mypdf.pdf"). All the PDF files have had their accessibility and content type verified with a HTTP HEAD request on Dec. 09 2024.
* Update README.md
* Update README.md
* Add files via upload
* Update README.md
added bulk file from EnergyFundsForAll.org
* Bulk list from EnergyFundsForAll
* Remove extra whitespace
Signed-off-by: Lauren Ko <lauren.ko@unt.edu>
* Remove duplicate listing of infodocket-11-21-2024.xls
---------
Signed-off-by: Lauren Ko <lauren.ko@unt.edu>
Co-authored-by: James R. Jacobs <freegovinfo@gmail.com>
* adding info docket bulk seed list
* Update README.md
* Update README.md
* Add files via upload
Bulk lists from Gary Price and Kelly Smith. Seed list readme updated with file names.
* Common Crawl Foundation seeds
* clean mil list to just hostnames
* doc: add location of ccf repo that generated these files
---------
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>