Commit Graph

1 Commits

Author SHA1 Message Date
YakShaver
a3d96841db
Adds FDA and NIH HTML URLs for seed list (#25)
* FDA HTML urls for seed list

This is a file of URLs derived from the FDA's sitemap.xml file. It has URLs of the form /media/*/download and the warning-letters filtered out, so in theory everything in this file is an HTML link. I plan on submitting the PDF content separately.

The format of this CSV file is: <sitemap file the URL is sourced from>,<the URL>

* FDA warning letters from sitemap.xml

This is a file of URLs derived from the FDA's sitemap.xml file with warning letters content. I thought this was going to be PDF content, but it turns out to be HTML content. 

The format of this CSV file is: sitemap file the URL is sourced from,the URL

* FDA download links from sitemaps

This is a file of URLs derived from the FDA's sitemap.xml file, where the link is of the form /media/id/download. These resolve to a PDF file rendered in a HTML wrapper using Mozilla's pdfjs library -- so the download in the path name is a little misleading.

The format of this CSV file is: sitemap file the URL is sourced from,the URL

* NIH urls from three sitemaps.

This is a file of URLs derived from three NIH sitemap files: https://www.nih.gov/sitemap.xml, https://newsinhealth.nih.gov/sitemap.xml and https://nihrecord.nih.gov/sitemap.xml. 

The format of this CSV file is: sitemap file the URL is sourced from,the URL
2025-01-02 11:33:40 -06:00