* Update README.md
added another bulk list from Gary Price/Infodocket. File is USDA_FIS_ERS.xlsx
* Add files via upload
USDA_FIS_ERS.xlsx from Gary Price/infodocket. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service.
* Update README.md
3 more bulk lists from Gary Price sent on 12/14/2024
* Add files via upload
3 new bulk lists from Gary Price submitted 12/14/2024
* Update README.md
added another bulk list from Gary Price/Infodocket. File is USDA_FIS_ERS.xlsx
* Add files via upload
USDA_FIS_ERS.xlsx from Gary Price/infodocket. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service.
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Batch commit of sitemap URL seeds under 500MB or 250 files
* Forgot to add sitemaps.txt
Signed-off-by: Bentley Hensel <bentleyhensel@gmail.com>
---------
Signed-off-by: Bentley Hensel <bentleyhensel@gmail.com>
This is a csv file of PDF links obtained from webpages found on the US CDC website. It contains 46,873 links, with the format: the source HTML file containing the PDF link; the time in UTC in which the accessibility of the PDF file was confirmed; and a URL pointing to the PDF file itself.
This file replaces the two previous files. This file has had the PDF links deduped, so if multiple pages point to the same PDF, you'll only see an entry for the first reference. PDF links that point to non-gov domains have been omitted as well.If the PDF link contains a fragment, the fragment will be removed from the path (e.g. "/a/path/mypdf.pdf#page=3" will get turned into "/a/path/mypdf.pdf"). All the PDF files have had their accessibility and content type verified with a HTTP HEAD request on Dec. 09 2024.
* Update README.md
* Update README.md
* Add files via upload
* Update README.md
added bulk file from EnergyFundsForAll.org
* Bulk list from EnergyFundsForAll
* Remove extra whitespace
Signed-off-by: Lauren Ko <lauren.ko@unt.edu>
* Remove duplicate listing of infodocket-11-21-2024.xls
---------
Signed-off-by: Lauren Ko <lauren.ko@unt.edu>
Co-authored-by: James R. Jacobs <freegovinfo@gmail.com>
* adding info docket bulk seed list
* Update README.md
* Update README.md
* Add files via upload
Bulk lists from Gary Price and Kelly Smith. Seed list readme updated with file names.
* Common Crawl Foundation seeds
* clean mil list to just hostnames
* doc: add location of ccf repo that generated these files
---------
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>