Commit Graph

59 Commits

Author SHA1 Message Date
YakShaver
a3d96841db
Adds FDA and NIH HTML URLs for seed list (#25)
* FDA HTML urls for seed list

This is a file of URLs derived from the FDA's sitemap.xml file. It has URLs of the form /media/*/download and the warning-letters filtered out, so in theory everything in this file is an HTML link. I plan on submitting the PDF content separately.

The format of this CSV file is: <sitemap file the URL is sourced from>,<the URL>

* FDA warning letters from sitemap.xml

This is a file of URLs derived from the FDA's sitemap.xml file with warning letters content. I thought this was going to be PDF content, but it turns out to be HTML content. 

The format of this CSV file is: sitemap file the URL is sourced from,the URL

* FDA download links from sitemaps

This is a file of URLs derived from the FDA's sitemap.xml file, where the link is of the form /media/id/download. These resolve to a PDF file rendered in a HTML wrapper using Mozilla's pdfjs library -- so the download in the path name is a little misleading.

The format of this CSV file is: sitemap file the URL is sourced from,the URL

* NIH urls from three sitemaps.

This is a file of URLs derived from three NIH sitemap files: https://www.nih.gov/sitemap.xml, https://newsinhealth.nih.gov/sitemap.xml and https://nihrecord.nih.gov/sitemap.xml. 

The format of this CSV file is: sitemap file the URL is sourced from,the URL
2025-01-02 11:33:40 -06:00
Lauren Ko
526cb84dd0 Add State Department FOIA PDF seed lists to README 2025-01-02 11:26:40 -06:00
Antoine McGrath
9c4b1910fd
Add files via upload (#24)
Added State Department FOIA URLs
The 458,130 URLs included in the files USAStateFOIA_pdf_urls_part1.txt and USAStateFOIA_pdf_urls_part2.txt are derived from the US State Departments Freedom of Information Act (FOIA) Virtual Reading Room search database. The website serves thousands of Collections as Search Pages (2,940) and 455,190 endpoint PDFs. 

2,940 Search Pages
Examples
http://foia.state.gov/Search/Results.aspx?collection=Clinton_Email_February_29_Release
http://foia.state.gov/Search/Results.aspx?collection=Litigation_F-2016-07895_6
https://foia.state.gov/Search/Results.aspx?caseNumber=F-1991-05139
https://foia.state.gov/Search/Results.aspx?IRIA.aspx
https://foia.state.gov/Search/Results.aspx?Microfiche.aspx


455,190 PDF URLs
Examples
https://foia.state.gov/DOCUMENTS/1-FY2012/F-2004-02207/DOC_0C17731327/C17731327.pdf
https://foia.state.gov/DOCUMENTS/FOIA_Micro_Aug2024_6/F-1986-01832/DOC_0C09000001/C09000001.pdf
https://foia.state.gov/DOCUMENTS/FOIA_Micro_Oct2024_7/F-1989-00718/DOC_0C09000006/C09000006.pdf
https://foia.state.gov/DOCUMENTS/Litigation/HRCLitigation_1/JW7 RD4 02-24-2014 - 1 of 7.sdhdhpdf_Part1.pdf
https://foia.state.gov/DOCUMENTS\Argentina\0000AFA2.pdf
2025-01-02 11:22:58 -06:00
Lauren Ko
3e95bc46a9 Tweak seed lists README 2025-01-02 11:02:08 -06:00
James R. Jacobs
59b04f37a8
Add 6 bulk lists (#23)
* Update README.md

2 new lists from EDGI and Hermann-Wu submitted via dot-info@archive.org

* Add files via upload

2 new lists from EDGI and Hermann-Wu submitted via dot-info@archive.org

* Update README.md

bulk list re USDA seeds submitted by AWI 20241222

* Add files via upload

bulk list re USDA seeds submitted by AWI 20241222

* Update README.md

3 bulk lists added: 1 from Gary Price and 2 from eot-info submissions: AWI-XL-4-20241224.xlsx,  AWI-USDA-FSIS-20241222.xlsx,  NSF-20241224.xlsx

* Add files via upload

3 bulk lists added: 1 from Gary Price and 2 from eot-info submissions: AWI-XL-4-20241224.xlsx,  AWI-USDA-FSIS-20241222.xlsx,  NSF-20241224.xlsx

* Update README.md

list submitted by Natl Indian Law Library Natl-Indian-Law-Library-bulk-seeds-20241224.xlsx

* Add files via upload

list submitted by Natl Indian Law Library Natl-Indian-Law-Library-bulk-seeds-20241224.xlsx
2025-01-02 10:56:39 -06:00
James R. Jacobs
06bfdd7bcd Add files via upload
Sustainability-gov-Hermann-Wu-20241220.xlsx

Signed-off-by: Lauren Ko <lauren.ko@unt.edu>
2024-12-20 09:24:08 -06:00
James R. Jacobs
bf7cf89659 Update README.md
added bulk list Sustainability-gov-Hermann-Wu-20241220.xlsx
2024-12-20 09:19:42 -06:00
Lauren Ko
86a3364700 Add Defenders of Wildlife seeds 2024-12-19 15:30:47 -06:00
James R. Jacobs
0f565c94e4
bulk list submitted 20241219 by Ailsa Hermann-Wu (#21)
* Update README.md

bulk list on performance.gov by Ailsa Hermann-Wu

* Add files via upload

Bulk list submitted by Ailsa Hermann-Wu re performance.gov 20241219
2024-12-19 15:00:38 -06:00
James R. Jacobs
f1694a635c
bulk seed list of GAO seeds by Ailsa Hermann-Wu (#20)
* Update README.md

bulk seed list of GAO seeds sent by Ailsa Hermann-Wu

* Add files via upload

bulk seed list of GAO seeds sent by Ailsa Hermann-Wu
2024-12-18 10:14:24 -06:00
James R. Jacobs
ef3bd7d5f9
3 new bulk lists submitted by Gary Price (#19)
* Update README.md

added another bulk list from Gary Price/Infodocket. File is USDA_FIS_ERS.xlsx

* Add files via upload

USDA_FIS_ERS.xlsx from Gary Price/infodocket. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service.

* Update README.md

3 more bulk lists from Gary Price sent on 12/14/2024

* Add files via upload

3 new bulk lists from Gary Price submitted 12/14/2024
2024-12-16 14:11:03 -06:00
James R. Jacobs
ed7cabab8e
another bulk seed list from Gary Price (USDA) (#18)
* Update README.md

added another bulk list from Gary Price/Infodocket. File is USDA_FIS_ERS.xlsx

* Add files via upload

USDA_FIS_ERS.xlsx from Gary Price/infodocket. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service.
2024-12-12 15:38:05 -06:00
Lauren Ko
97a727fc4e Update README for sitemaps.txt and sitemap-url-seeds directory 2024-12-11 13:21:44 -06:00
Lauren Ko
94e610e8e1
Merge pull request #14 from TheBoatyMcBoatFace/main
* Batch commit of sitemap URL seeds under 500MB or 250 files

* Batch commit of sitemap URL seeds under 500MB or 250 files

* Batch commit of sitemap URL seeds under 500MB or 250 files

* Batch commit of sitemap URL seeds under 500MB or 250 files

* Batch commit of sitemap URL seeds under 500MB or 250 files

* Batch commit of sitemap URL seeds under 500MB or 250 files

* Batch commit of sitemap URL seeds under 500MB or 250 files

* Batch commit of sitemap URL seeds under 500MB or 250 files

* Batch commit of sitemap URL seeds under 500MB or 250 files

* Batch commit of sitemap URL seeds under 500MB or 250 files

* Batch commit of sitemap URL seeds under 500MB or 250 files

* Forgot to add sitemaps.txt

Signed-off-by: Bentley Hensel <bentleyhensel@gmail.com>

---------

Signed-off-by: Bentley Hensel <bentleyhensel@gmail.com>
2024-12-11 12:58:16 -06:00
Bentley Hensel
bad24fe745
Forgot to add sitemaps.txt
Signed-off-by: Bentley Hensel <bentleyhensel@gmail.com>
2024-12-10 20:36:47 -05:00
Bentley Hensel
e2054da35c
Merge pull request #3 from end-of-term/main
update
2024-12-10 20:33:49 -05:00
Lauren Ko
3b3bf304b9 Update README for CDC PDFs 2024-12-10 14:55:01 -06:00
YakShavingAsAService
f4b194553a
PDFs from the CDC website - single file (#17)
This is a csv file of PDF links obtained from webpages found on the US CDC website. It contains 46,873 links, with the format: the source HTML file containing the PDF link; the time in UTC in which the accessibility of the PDF file was confirmed; and a URL pointing to the PDF file itself.
    
This file replaces the two previous files. This file has had the PDF links deduped, so if multiple pages point to the same PDF, you'll only see an entry for the first reference. PDF links that point to non-gov domains have been omitted as well.If the PDF link contains a fragment, the fragment will be removed from the path (e.g.  "/a/path/mypdf.pdf#page=3" will get turned into "/a/path/mypdf.pdf"). All the PDF files have had their accessibility and content type verified with a HTTP HEAD request on Dec. 09 2024.
2024-12-10 14:51:36 -06:00
James R. Jacobs
5a9195431e
bulk list of NPS seeds submitted by Hermann-Wu - Hermann-Wu-nps-20241209.txt (#16)
* Update README.md

added NPS seeds submitted by Hermann-Wu - Hermann-Wu-nps-20241209.txt

* Add files via upload

NPS seeds submitted by Hermann-Wu - Hermann-Wu-nps-20241209.txt

* Update README.md

edited the contact section.
2024-12-10 12:48:14 -06:00
Bentley Hensel
4bef8b223d
Merge pull request #2 from end-of-term/main
Add some Bureau of Land Management and EnergyFundsForAll.org seeds (#15)
2024-12-09 15:59:46 -05:00
Lauren Ko
ed4d0f0d8a
Add some Bureau of Land Management and EnergyFundsForAll.org seeds (#15)
* Update README.md

* Update README.md

* Add files via upload

* Update README.md

added bulk file from EnergyFundsForAll.org

* Bulk list from EnergyFundsForAll

* Remove extra whitespace

Signed-off-by: Lauren Ko <lauren.ko@unt.edu>

* Remove duplicate listing of infodocket-11-21-2024.xls

---------

Signed-off-by: Lauren Ko <lauren.ko@unt.edu>
Co-authored-by: James R. Jacobs <freegovinfo@gmail.com>
2024-12-09 14:28:43 -06:00
Bentley Hensel
7a74ece080
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:41:07 -05:00
Bentley Hensel
bd3fdbde47
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:38:34 -05:00
Bentley Hensel
49aee9c7bc
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:37:48 -05:00
Bentley Hensel
bf267e339e
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:37:09 -05:00
Bentley Hensel
c015b8b98d
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:35:50 -05:00
Bentley Hensel
980fa37e2a
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:33:46 -05:00
Bentley Hensel
4042707213
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:33:25 -05:00
Bentley Hensel
0535ad7cf2
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:32:01 -05:00
Bentley Hensel
73719faa91
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:29:05 -05:00
Bentley Hensel
4d70936a23
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:19:10 -05:00
Bentley Hensel
aeea7beac2
Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 17:56:32 -05:00
Lauren Ko
a6e38c7311 Add CDC .html seed list 2024-12-03 15:53:17 -06:00
Lauren Ko
47e8f8eb67 Add Bluesky URL
Co-authored-by: Melody Joy Kramer <melodykramer@gmail.com>
2024-12-03 12:34:32 -06:00
James R. Jacobs
d633f6965c
uploaded new bulk seed files from Gary Price and Kelly Smith (#11)
* adding info docket bulk seed list

* Update README.md

* Update README.md

* Add files via upload

Bulk lists from Gary Price and Kelly Smith. Seed list readme updated with file names.
2024-12-02 12:11:12 -06:00
Lauren Ko
4519cb1ee8 Add NLM seed list 2024-11-22 13:18:59 -06:00
Lauren Ko
37b32203c5 Add irs.gov seeds from Gary Price 2024-11-21 13:24:36 -06:00
James R. Jacobs
58e14710e3
pull requests for info docket bulk list 11-21-2024 (#5)
* adding info docket bulk seed list

* Update README.md
2024-11-21 12:56:50 -06:00
Lauren Ko
7e3d04ed8c Update README for bsky_gov_urlverified.txt 2024-11-21 08:54:52 -06:00
Antoine McGrath
01662e4c87
Create bsky_gov_urlverified.txt (#4)
URLs for official US Senate.gov and House.gov bluesky accounts
2024-11-21 08:49:00 -06:00
Lauren Ko
3a14a8fb3f Add seed list from EDGI 2024-11-14 09:54:06 -06:00
Lauren Ko
8e8c22e358 Add updated govspeak list 2024-11-08 09:36:35 -06:00
Lauren Ko
a325cf3f79 Add two lists supplied by James Jacobs 2024-10-25 16:48:38 -05:00
Lauren Ko
99460625a9 Add usagov.csv seed list 2024-09-23 11:10:35 -05:00
Greg Lindahl
ba124bec62
Common Crawl seeds (#3)
* Common Crawl Foundation seeds

* clean mil list to just hostnames

* doc: add location of ccf repo that generated these files

---------

Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
2024-09-16 09:33:58 -05:00
Lauren Ko
4392d90188 Add more files from web resources 2024-09-12 15:54:23 -05:00
Lauren Ko
b9dfb4f189 Add seeds from https://touchpoints.app.cloud.gov/registry 2024-09-12 12:19:09 -05:00
Lauren Ko
1b1b4736b4 Add NARA's 118th House Seeds 2024-09-09 16:24:56 -05:00
Lauren Ko
e49378d304 Adding seed lists from NARA and in-scope non gov/mil PURL target domain csv 2024-09-06 16:10:50 -05:00
Lauren Ko
a7cf90dd34 Add GovSpeak seeds 2024-08-01 11:59:00 -05:00