mirror of
https://github.com/end-of-term/eot2024
synced 2025-01-18 13:13:43 +01:00
PDFs from the CDC website - single file (#17)
This is a csv file of PDF links obtained from webpages found on the US CDC website. It contains 46,873 links, with the format: the source HTML file containing the PDF link; the time in UTC in which the accessibility of the PDF file was confirmed; and a URL pointing to the PDF file itself. This file replaces the two previous files. This file has had the PDF links deduped, so if multiple pages point to the same PDF, you'll only see an entry for the first reference. PDF links that point to non-gov domains have been omitted as well.If the PDF link contains a fragment, the fragment will be removed from the path (e.g. "/a/path/mypdf.pdf#page=3" will get turned into "/a/path/mypdf.pdf"). All the PDF files have had their accessibility and content type verified with a HTTP HEAD request on Dec. 09 2024.
This commit is contained in:
parent
5a9195431e
commit
f4b194553a
46874
seed-lists/CDC found PDFs 20241209 cleaned single file.csv
Normal file
46874
seed-lists/CDC found PDFs 20241209 cleaned single file.csv
Normal file
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue
Block a user