mirror of
https://github.com/end-of-term/eot2024
synced 2025-01-18 21:13:45 +01:00
Update README for FDA and NIH sitemap seed lists
This commit is contained in:
parent
a3d96841db
commit
ee6cba5868
@ -111,3 +111,7 @@ The End of Term Web Archive team and other contributors compiled a list of sourc
|
|||||||
* CDC html URLs from sitemap data - 20241201.csv - file of about 46,000 .html URLs created by parsing the CDC's sitemap file at https://www.cdc.gov/wcms-auto-sitemap-index.xml, which then pointed to other sitemaps, which pointed to .html files.
|
* CDC html URLs from sitemap data - 20241201.csv - file of about 46,000 .html URLs created by parsing the CDC's sitemap file at https://www.cdc.gov/wcms-auto-sitemap-index.xml, which then pointed to other sitemaps, which pointed to .html files.
|
||||||
* CDC found PDFs 20241209 cleaned single file.csv - .gov PDF links obtained from webpages found on the US CDC website. It contains 46,873 links, with the format: the source HTML file containing the PDF link; the time in UTC in which the accessibility of the PDF file was confirmed; and a URL pointing to the PDF file itself. PDF links are deduped when multiple pages point to the same PDF, and link fragments are removed. All the PDF files have had their accessibility and content type verified with a HTTP HEAD request on Dec. 09 2024.
|
* CDC found PDFs 20241209 cleaned single file.csv - .gov PDF links obtained from webpages found on the US CDC website. It contains 46,873 links, with the format: the source HTML file containing the PDF link; the time in UTC in which the accessibility of the PDF file was confirmed; and a URL pointing to the PDF file itself. PDF links are deduped when multiple pages point to the same PDF, and link fragments are removed. All the PDF files have had their accessibility and content type verified with a HTTP HEAD request on Dec. 09 2024.
|
||||||
* sitemaps.txt - List of federal website sitemap URLs discovered programmatically by Bentley Hensel via robots.txt files and common sitemap URL paths. URLs scraped from these sitemaps are organized by files named by hostname in the sitemap-url-seeds directory. This directory contains 2,749 files listing more than 56 million URLs as well as `_report_6_FINAL.txt` that gives statistics about the file sizes and URL counts in each of the files. URLs from the sitemap-url-seeds files will NOT be loaded into the eth2024_bulk Nomination Tool instance due to size.
|
* sitemaps.txt - List of federal website sitemap URLs discovered programmatically by Bentley Hensel via robots.txt files and common sitemap URL paths. URLs scraped from these sitemaps are organized by files named by hostname in the sitemap-url-seeds directory. This directory contains 2,749 files listing more than 56 million URLs as well as `_report_6_FINAL.txt` that gives statistics about the file sizes and URL counts in each of the files. URLs from the sitemap-url-seeds files will NOT be loaded into the eth2024_bulk Nomination Tool instance due to size.
|
||||||
|
* fda-download-urls.csv - URLs derived from the FDA's sitemap.xml file, where the link is of the form /media/id/download. These resolve to a PDF file rendered in a HTML wrapper using Mozilla's pdfjs library -- so the download in the path name is a little misleading. The format of this CSV file is: sitemap file the URL is sourced from,the URL.
|
||||||
|
* fda-no-downloads-no-warning-letters.csv - URLs derived from the FDA's sitemap.xml file. It has URLs of the form /media/\*/download and the warning-letters filtered out, so in theory everything in this file is HTML. The format of this CSV file is: sitemap file the URL is sourced from,the URL.
|
||||||
|
* fda-warnings-letters.csv - URLs derived from the FDA's sitemap.xml file with warning letters content in HTML. The format of this CSV file is: sitemap file the URL is sourced from,the URL.
|
||||||
|
* nih-urls.csv - URLs derived from three NIH sitemap files: https://www.nih.gov/sitemap.xml, https://newsinhealth.nih.gov/sitemap.xml and https://nihrecord.nih.gov/sitemap.xml. The format of this CSV file is: sitemap file the URL is sourced from,the URL.
|
||||||
|
Loading…
Reference in New Issue
Block a user