Adds FDA and NIH HTML URLs for seed list (#25)

* FDA HTML urls for seed list

This is a file of URLs derived from the FDA's sitemap.xml file. It has URLs of the form /media/*/download and the warning-letters filtered out, so in theory everything in this file is an HTML link. I plan on submitting the PDF content separately.

The format of this CSV file is: <sitemap file the URL is sourced from>,<the URL>

* FDA warning letters from sitemap.xml

This is a file of URLs derived from the FDA's sitemap.xml file with warning letters content. I thought this was going to be PDF content, but it turns out to be HTML content. 

The format of this CSV file is: sitemap file the URL is sourced from,the URL

* FDA download links from sitemaps

This is a file of URLs derived from the FDA's sitemap.xml file, where the link is of the form /media/id/download. These resolve to a PDF file rendered in a HTML wrapper using Mozilla's pdfjs library -- so the download in the path name is a little misleading.

The format of this CSV file is: sitemap file the URL is sourced from,the URL

* NIH urls from three sitemaps.

This is a file of URLs derived from three NIH sitemap files: https://www.nih.gov/sitemap.xml, https://newsinhealth.nih.gov/sitemap.xml and https://nihrecord.nih.gov/sitemap.xml. 

The format of this CSV file is: sitemap file the URL is sourced from,the URL
This commit is contained in:
YakShaver 2025-01-02 09:33:40 -08:00 committed by GitHub
parent 526cb84dd0
commit a3d96841db
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 100935 additions and 0 deletions

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

15044
seed-lists/nih-urls.csv Normal file

File diff suppressed because it is too large Load Diff