* FDA HTML urls for seed list
This is a file of URLs derived from the FDA's sitemap.xml file. It has URLs of the form /media/*/download and the warning-letters filtered out, so in theory everything in this file is an HTML link. I plan on submitting the PDF content separately.
The format of this CSV file is: <sitemap file the URL is sourced from>,<the URL>
* FDA warning letters from sitemap.xml
This is a file of URLs derived from the FDA's sitemap.xml file with warning letters content. I thought this was going to be PDF content, but it turns out to be HTML content.
The format of this CSV file is: sitemap file the URL is sourced from,the URL
* FDA download links from sitemaps
This is a file of URLs derived from the FDA's sitemap.xml file, where the link is of the form /media/id/download. These resolve to a PDF file rendered in a HTML wrapper using Mozilla's pdfjs library -- so the download in the path name is a little misleading.
The format of this CSV file is: sitemap file the URL is sourced from,the URL
* NIH urls from three sitemaps.
This is a file of URLs derived from three NIH sitemap files: https://www.nih.gov/sitemap.xml, https://newsinhealth.nih.gov/sitemap.xml and https://nihrecord.nih.gov/sitemap.xml.
The format of this CSV file is: sitemap file the URL is sourced from,the URL