96 Commits

Author SHA1 Message Date
James R. Jacobs
b13db480ec
USIP bulk seed list (#57)
* Update README.md

USIP-20250319.xlsx. 4500 seeds from US Institute for Peace, Congressionally funded quasi-governmental org.

* Add files via upload

USIP-20250319.xlsx. 4500 seeds from US Institute for Peace, Congressionally funded quasi-governmental org.
2025-03-20 17:15:53 -05:00
Lauren Ko
0abe61ed95 Give details about EO 2025-03-17 16:05:44 -05:00
James R. Jacobs
0db7305f40
5 new bulk lists from Gary Price (#56)
* Update README.md

5 new bulk lists from Gary Price

* Add files via upload

5 new bulk lists from Gary Price
2025-03-17 15:51:11 -05:00
James R. Jacobs
f1ee02416b
Added Infodocket bulk list StudentAidDOTgov-20250314.xlsx. 3300 seeds from StudentAid.gov. (#55)
Added Infordocket bulk list StudentAidDOTgov-20250314.xlsx. 3300 seeds from StudentAid.gov.

* Add files via upload

Added Infodocket bulk list StudentAidDOTgov-20250314.xlsx. 3300 seeds from StudentAid.gov.

* Update README.md

another bulk list from Gary Price. DOD PDF-20250314.xlsx. 411 DOD pdfs (and a few other formats).

* Add files via upload

DOD PDF-20250314.xlsx. 411 DOD pdfs (and a few other formats).
2025-03-17 12:48:39 -05:00
James R. Jacobs
a0ad05f63a
bulk list by James jacobs (#54)
* Update README.md

bulk list submitted by James Jacobs. NOAA voices oral histories. NOAA-Voices-oral-history-archive-20250311.xlsx

* Add files via upload

bulk list submitted by James Jacobs. NOAA voices oral histories. NOAA-Voices-oral-history-archive-20250311.xlsx
2025-03-12 09:43:04 -05:00
James R. Jacobs
00979bc355
2 new bulk lists from Gary Price: DOD_photos_20250310.xlsx and NASA-SCIENCE-20250310.xlsx. (#53)
* Update README.md

2 new bulk lists from Gary Price: DOD_photos_20250310.xlsx and NASA-SCIENCE-20250310.xlsx.

* Add files via upload

2 new bulk lists from Gary Price: DOD_photos_20250310.xlsx and NASA-SCIENCE-20250310.xlsx.
2025-03-11 10:51:00 -05:00
James R. Jacobs
5a1403f9d5
bulk list added by info docket NIH_record-20250307.xlsx. (#52)
* Update README.md

bulk list added by info docket NIH_record-20250307.xlsx.

* Add files via upload

bulk list added by info docket NIH_record-20250307.xlsx.
2025-03-07 17:57:10 -06:00
James R. Jacobs
ced506d749 Add files via upload
Infodocket bulk list NSF_policy-20250227.xlsx
2025-03-05 10:02:18 -06:00
James R. Jacobs
482c7a59dd Update README.md
Infodocket bulk list NSF_policy-20250227.xlsx
2025-03-05 10:02:18 -06:00
James R. Jacobs
6a27257683
Infodocket bulk list of Ukraine-US-embassy (#50)
* Update README.md

Infodocket bulk list of Ukraine-US-embassy

* Add files via upload

Infodocket bulk list of Ukraine-US-embassy
2025-02-25 11:51:46 -06:00
James R. Jacobs
c2907ffc76
2 new bulk lists from infodocket. (#49)
* Update README.md

2 new bulk lists from infodocket.

* Add files via upload

2 new bulk lists from infodocket.
2025-02-24 10:13:35 -06:00
James R. Jacobs
9568d659f5
Infodocket Bulk list HUD_comm_development-20250220.xlsx. (#48)
* Update README.md

Infodocket Bulk list HUD_comm_development-20250220.xlsx.

* Add files via upload

Infodocket Bulk list HUD_comm_development-20250220.xlsx.
2025-02-20 11:10:19 -06:00
James R. Jacobs
a5acf47b56
Infodocket bulk list NNSA-20250218.xlsx. (#47)
* Update README.md

Infodocket bulk list NNSA-20250218.xlsx.

* Add files via upload

Infodocket bulk list NNSA-20250218.xlsx.
2025-02-19 09:50:43 -06:00
James R. Jacobs
98d5d13ac6
Federal exec institute bulk list from Gary (#46)
* Update README.md

FEI bulk list from Gary Price

* Add files via upload

FEI bulk list from Gary Price
2025-02-13 11:17:31 -06:00
James R. Jacobs
1bc4dd420f
Infodocket Bulk seed list from NCES (#45)
* Update README.md

Bulk seed list from NCES

* Add files via upload

Bulk seed list from NCES
2025-02-12 16:26:16 -06:00
James R. Jacobs
225792c840
Infodocket bulk list seeds from the Office of Gov Ethics. (#44)
* Update README.md

Infodocket bulk list seeds from the Office of Gov Ethics.

* Add files via upload

Infodocket bulk list seeds from the Office of Gov Ethics.
2025-02-11 09:23:04 -06:00
James R. Jacobs
7f01ebaf84
2 new bulk lists from Infodocket re .mil seeds (#43)
* Update README.md

2 new bulk lists from Infodocket re .mil seeds

* Add files via upload

2 new bulk lists from Infodocket re .mil seeds
2025-02-10 12:52:23 -06:00
James R. Jacobs
81d9e8f745
added bulk list from data rescue 2025 inventories (#42)
* Update README.md

added bulk list from data rescue 2025 inventories

* Add files via upload

added bulk list from data rescue 2025 inventories
2025-02-10 09:48:12 -06:00
James R. Jacobs
4f6e19c4fb
info docket bulk list re transgender PDFs. (#41)
* Update README.md

info docket bulk list re transgender PDFs.

* Add files via upload

info docket bulk list re transgender PDFs.
2025-02-06 09:35:52 -06:00
James R. Jacobs
c8eac740f3
2 new bulk lists from Infodocket (#40)
* Update README.md

2 new bulk  lists from Infodocket

* Add files via upload

2 new bulk  lists from Infodocket
2025-02-05 17:15:03 -06:00
James R. Jacobs
f0c3e93273
data urls bulk list from Infodocket (#39)
* Update README.md

data urls bulk list from Infodocket

* Add files via upload

data urls bulk list from Infodocket
2025-02-03 16:02:14 -06:00
James R. Jacobs
5f0ab60be1
4 new bulk lists from Gary Price @ InfoDocket (#37)
* Update README.md

4 new bulk lists from Gary Price @ Infodocket.

* Add files via upload

4 new bulk lists from Gary Price.

* Update README.md

bulk list from LiL of data.gov catalog records

* Add files via upload

bulk list from Harvard LiL
2025-02-03 11:52:45 -06:00
James R. Jacobs
4ba7aa4008
Gary price bulk list for MSPB (#35)
* Update README.md

added bulk list from Infodocket.

* Add files via upload

new list from Gary Price re MSPB.
2025-01-29 09:35:12 -06:00
James R. Jacobs
52d684196b
2 new bulk lists added (#33)
* Update README.md

added 2 new bulk lists to readme from Gary Price/Infodocket and Campaign for Tobacco-Free Kids

* Add files via upload

added 2 new bulk lists from Gary Price/Infodocket and Campaign for Tobacco-Free Kids
2025-01-23 13:19:10 -06:00
Lauren Ko
c5f1d52ae1 Updating README 2025-01-20 09:56:48 -06:00
Nintendofan885
563dec5a68
Download links for data.cdc.gov (#32)
extracted from the URLs in #26
2025-01-20 09:53:49 -06:00
James R. Jacobs
21fafd407a
another Gary Price bulk list (#31)
* Update README.md

added MIT spreadsheet

* Add files via upload

added MIT spreadsheet

* Update README.md

new bulk list by EdTrust

* Add files via upload

new bulk list by EdTrust

* Update README.md

another Gary Price bulk list

* Add files via upload

another Gary Price bulk list
2025-01-20 09:52:53 -06:00
Lauren Ko
42262a8a6c Add another EDGI bulk list 2025-01-17 13:58:52 -06:00
Lauren Ko
4683dd7a8f Adding gov_sitemaps.csv to README 2025-01-17 13:56:10 -06:00
Antoine McGrath
2cb181958e
Create gov_sitemaps.csv (#7)
* Create gov_sitemaps.csv

Gov_Domain Sitemap URLs and the contents of those sitemaps

* Update gov_sitemaps.csv

These URLs are the compilation of various gov sitemaps. The references to non government content and standards have been removed. 
URLs found in gov sitemaps (including ones that link to gov social media profiles) remain.
2025-01-17 13:52:08 -06:00
James R. Jacobs
d851fb5e78
added MIT spreadsheet (#30)
* Update README.md

added MIT spreadsheet

* Add files via upload

added MIT spreadsheet

* Update README.md

new bulk list by EdTrust

* Add files via upload

new bulk list by EdTrust
2025-01-17 13:50:04 -06:00
James R. Jacobs
4a9c7e4b45
3 new bulk lists (#29)
* Update README.md

3 more bulk lists added to the readme

* Add files via upload

3 new bulk seed lists submitted to eot-info@archive.org
2025-01-14 13:15:01 -06:00
James R. Jacobs
f8b8df2490
new seed list submitted by Power To Decide (#28)
* Update README.md

Seed list submitted by Power To Decide

* Add files via upload

new seed list submitted by Power To Decide
2025-01-10 17:11:44 -06:00
James R. Jacobs
ed8e0e2a3e
7 new bulk lists from Infodocket and eot-info submissions. (#27)
* Update README.md

Several new bulk lists added from Infodocket and eot-info submissions.

* Add files via upload

7 new bulk lists from Infodocket and eot-info submissions.
2025-01-08 17:17:15 -06:00
Lauren Ko
c35eb3240e Add cdc-dataset-urls.csv to README 2025-01-06 10:07:15 -06:00
YakShaver
5a83e824d4
cdc datasets urls from sitemap (#26)
CDC dataset URLs derived from this sitemap file: https://s3.amazonaws.com/sa-socrata-sitemaps-us-east-1-fedramp-prod/sitemaps/sitemap-datasets-data.cdc.gov0.xml. The format of this CSV file is: sitemap file the URL is sourced from,the URL.

Note that the pages pointed to by these URLs usually include a download button to get a CSV of the dataset, but the CSVs themselves aren't included in this seed file and won't be retrieved in the crawl. But at least the existence of the dataset and links to its metadata will be documented in the archive.
2025-01-06 10:05:19 -06:00
Lauren Ko
ee6cba5868 Update README for FDA and NIH sitemap seed lists 2025-01-02 12:32:17 -06:00
YakShaver
a3d96841db
Adds FDA and NIH HTML URLs for seed list (#25)
* FDA HTML urls for seed list

This is a file of URLs derived from the FDA's sitemap.xml file. It has URLs of the form /media/*/download and the warning-letters filtered out, so in theory everything in this file is an HTML link. I plan on submitting the PDF content separately.

The format of this CSV file is: <sitemap file the URL is sourced from>,<the URL>

* FDA warning letters from sitemap.xml

This is a file of URLs derived from the FDA's sitemap.xml file with warning letters content. I thought this was going to be PDF content, but it turns out to be HTML content. 

The format of this CSV file is: sitemap file the URL is sourced from,the URL

* FDA download links from sitemaps

This is a file of URLs derived from the FDA's sitemap.xml file, where the link is of the form /media/id/download. These resolve to a PDF file rendered in a HTML wrapper using Mozilla's pdfjs library -- so the download in the path name is a little misleading.

The format of this CSV file is: sitemap file the URL is sourced from,the URL

* NIH urls from three sitemaps.

This is a file of URLs derived from three NIH sitemap files: https://www.nih.gov/sitemap.xml, https://newsinhealth.nih.gov/sitemap.xml and https://nihrecord.nih.gov/sitemap.xml. 

The format of this CSV file is: sitemap file the URL is sourced from,the URL
2025-01-02 11:33:40 -06:00
Lauren Ko
526cb84dd0 Add State Department FOIA PDF seed lists to README 2025-01-02 11:26:40 -06:00
Antoine McGrath
9c4b1910fd
Add files via upload (#24)
Added State Department FOIA URLs
The 458,130 URLs included in the files USAStateFOIA_pdf_urls_part1.txt and USAStateFOIA_pdf_urls_part2.txt are derived from the US State Departments Freedom of Information Act (FOIA) Virtual Reading Room search database. The website serves thousands of Collections as Search Pages (2,940) and 455,190 endpoint PDFs. 

2,940 Search Pages
Examples
http://foia.state.gov/Search/Results.aspx?collection=Clinton_Email_February_29_Release
http://foia.state.gov/Search/Results.aspx?collection=Litigation_F-2016-07895_6
https://foia.state.gov/Search/Results.aspx?caseNumber=F-1991-05139
https://foia.state.gov/Search/Results.aspx?IRIA.aspx
https://foia.state.gov/Search/Results.aspx?Microfiche.aspx


455,190 PDF URLs
Examples
https://foia.state.gov/DOCUMENTS/1-FY2012/F-2004-02207/DOC_0C17731327/C17731327.pdf
https://foia.state.gov/DOCUMENTS/FOIA_Micro_Aug2024_6/F-1986-01832/DOC_0C09000001/C09000001.pdf
https://foia.state.gov/DOCUMENTS/FOIA_Micro_Oct2024_7/F-1989-00718/DOC_0C09000006/C09000006.pdf
https://foia.state.gov/DOCUMENTS/Litigation/HRCLitigation_1/JW7 RD4 02-24-2014 - 1 of 7.sdhdhpdf_Part1.pdf
https://foia.state.gov/DOCUMENTS\Argentina\0000AFA2.pdf
2025-01-02 11:22:58 -06:00
Lauren Ko
3e95bc46a9 Tweak seed lists README 2025-01-02 11:02:08 -06:00
James R. Jacobs
59b04f37a8
Add 6 bulk lists (#23)
* Update README.md

2 new lists from EDGI and Hermann-Wu submitted via dot-info@archive.org

* Add files via upload

2 new lists from EDGI and Hermann-Wu submitted via dot-info@archive.org

* Update README.md

bulk list re USDA seeds submitted by AWI 20241222

* Add files via upload

bulk list re USDA seeds submitted by AWI 20241222

* Update README.md

3 bulk lists added: 1 from Gary Price and 2 from eot-info submissions: AWI-XL-4-20241224.xlsx,  AWI-USDA-FSIS-20241222.xlsx,  NSF-20241224.xlsx

* Add files via upload

3 bulk lists added: 1 from Gary Price and 2 from eot-info submissions: AWI-XL-4-20241224.xlsx,  AWI-USDA-FSIS-20241222.xlsx,  NSF-20241224.xlsx

* Update README.md

list submitted by Natl Indian Law Library Natl-Indian-Law-Library-bulk-seeds-20241224.xlsx

* Add files via upload

list submitted by Natl Indian Law Library Natl-Indian-Law-Library-bulk-seeds-20241224.xlsx
2025-01-02 10:56:39 -06:00
James R. Jacobs
06bfdd7bcd Add files via upload
Sustainability-gov-Hermann-Wu-20241220.xlsx

Signed-off-by: Lauren Ko <lauren.ko@unt.edu>
2024-12-20 09:24:08 -06:00
James R. Jacobs
bf7cf89659 Update README.md
added bulk list Sustainability-gov-Hermann-Wu-20241220.xlsx
2024-12-20 09:19:42 -06:00
Lauren Ko
86a3364700 Add Defenders of Wildlife seeds 2024-12-19 15:30:47 -06:00
James R. Jacobs
0f565c94e4
bulk list submitted 20241219 by Ailsa Hermann-Wu (#21)
* Update README.md

bulk list on performance.gov by Ailsa Hermann-Wu

* Add files via upload

Bulk list submitted by Ailsa Hermann-Wu re performance.gov 20241219
2024-12-19 15:00:38 -06:00
James R. Jacobs
f1694a635c
bulk seed list of GAO seeds by Ailsa Hermann-Wu (#20)
* Update README.md

bulk seed list of GAO seeds sent by Ailsa Hermann-Wu

* Add files via upload

bulk seed list of GAO seeds sent by Ailsa Hermann-Wu
2024-12-18 10:14:24 -06:00
James R. Jacobs
ef3bd7d5f9
3 new bulk lists submitted by Gary Price (#19)
* Update README.md

added another bulk list from Gary Price/Infodocket. File is USDA_FIS_ERS.xlsx

* Add files via upload

USDA_FIS_ERS.xlsx from Gary Price/infodocket. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service.

* Update README.md

3 more bulk lists from Gary Price sent on 12/14/2024

* Add files via upload

3 new bulk lists from Gary Price submitted 12/14/2024
2024-12-16 14:11:03 -06:00
James R. Jacobs
ed7cabab8e
another bulk seed list from Gary Price (USDA) (#18)
* Update README.md

added another bulk list from Gary Price/Infodocket. File is USDA_FIS_ERS.xlsx

* Add files via upload

USDA_FIS_ERS.xlsx from Gary Price/infodocket. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service.
2024-12-12 15:38:05 -06:00
Lauren Ko
97a727fc4e Update README for sitemaps.txt and sitemap-url-seeds directory 2024-12-11 13:21:44 -06:00