eot2024/seed-lists
James R. Jacobs 06bfdd7bcd Add files via upload
Sustainability-gov-Hermann-Wu-20241220.xlsx

Signed-off-by: Lauren Ko <lauren.ko@unt.edu>
2024-12-20 09:24:08 -06:00
..
sitemap-url-seeds Batch commit of sitemap URL seeds under 500MB or 250 files 2024-12-05 18:41:07 -05:00
2_govt_urls_federal_only.csv Add more files from web resources 2024-09-12 15:54:23 -05:00
117th_House_Seeds.xlsx Adding seed lists from NARA and in-scope non gov/mil PURL target domain csv 2024-09-06 16:10:50 -05:00
117th_Senate_Seeds.xlsx Adding seed lists from NARA and in-scope non gov/mil PURL target domain csv 2024-09-06 16:10:50 -05:00
118th_House_Seed_List.xlsx Add NARA's 118th House Seeds 2024-09-09 16:24:56 -05:00
118th_Senate.xlsx Adding seed lists from NARA and in-scope non gov/mil PURL target domain csv 2024-09-06 16:10:50 -05:00
ARPA-H.xlsx 3 new bulk lists submitted by Gary Price (#19) 2024-12-16 14:11:03 -06:00
BLM 2020-2024.xlsx Add some Bureau of Land Management and EnergyFundsForAll.org seeds (#15) 2024-12-09 14:28:43 -06:00
bsky_gov_urlverified.txt Create bsky_gov_urlverified.txt (#4) 2024-11-21 08:49:00 -06:00
ccf-gov-federal-web-graph-2024-jun-jul-aug.txt Common Crawl seeds (#3) 2024-09-16 09:33:58 -05:00
ccf-mil-web-graph-2024-jun-jul-aug.txt Common Crawl seeds (#3) 2024-09-16 09:33:58 -05:00
CDC found PDFs 20241209 cleaned single file.csv PDFs from the CDC website - single file (#17) 2024-12-10 14:51:36 -06:00
CDC html URLs from sitemap data - 20241201.csv Add CDC .html seed list 2024-12-03 15:53:17 -06:00
CRS_ReportsList.csv Nominated URLs for CRS Reports 2024-05-29 21:05:02 -05:00
current-federal.csv Add more files from web resources 2024-09-12 15:54:23 -05:00
DNI and CFPB.xlsx uploaded new bulk seed files from Gary Price and Kelly Smith (#11) 2024-12-02 12:11:12 -06:00
DOJ, White_House, DEA, ATF, FBI.xlsx uploaded new bulk seed files from Gary Price and Kelly Smith (#11) 2024-12-02 12:11:12 -06:00
dotmil_websites.csv Add more files from web resources 2024-09-12 15:54:23 -05:00
EoT archive submission - DoW 12-19-24.txt Add Defenders of Wildlife seeds 2024-12-19 15:30:47 -06:00
eot_lgbtqandmisc.txt uploaded new bulk seed files from Gary Price and Kelly Smith (#11) 2024-12-02 12:11:12 -06:00
FDA_letters_releases_approvals.xlsx uploaded new bulk seed files from Gary Price and Kelly Smith (#11) 2024-12-02 12:11:12 -06:00
FDLP_WEb_Archiveseed_list_20240212.csv Add seed lists from GPO 2024-02-16 14:13:29 -06:00
Federal URLs linked to on EnergyFundsForAll.org.xlsx Add some Bureau of Land Management and EnergyFundsForAll.org seeds (#15) 2024-12-09 14:28:43 -06:00
FOIA_Libraries_Dataset_Oct_3_2023_Final.xlsx Add spreadsheet for James Jacobs 2024-05-08 16:37:50 -05:00
GAO-hermann-wu-20241218.xlsx bulk seed list of GAO seeds by Ailsa Hermann-Wu (#20) 2024-12-18 10:14:24 -06:00
govdoc-l-seeds-2024.txt Add two lists supplied by James Jacobs 2024-10-25 16:48:38 -05:00
govspeakeot080124.xlsx Add GovSpeak seeds 2024-08-01 11:59:00 -05:00
govspeakurls1124.txt Add updated govspeak list 2024-11-08 09:36:35 -06:00
Hermann-Wu-nps-20241209.txt bulk list of NPS seeds submitted by Hermann-Wu - Hermann-Wu-nps-20241209.txt (#16) 2024-12-10 12:48:14 -06:00
HHS 2020-.xlsx uploaded new bulk seed files from Gary Price and Kelly Smith (#11) 2024-12-02 12:11:12 -06:00
HRSA (2020-).xlsx uploaded new bulk seed files from Gary Price and Kelly Smith (#11) 2024-12-02 12:11:12 -06:00
IARPA 2020-Present.xlsx 3 new bulk lists submitted by Gary Price (#19) 2024-12-16 14:11:03 -06:00
infodocket-11-21-2024.xlsx pull requests for info docket bulk list 11-21-2024 (#5) 2024-11-21 12:56:50 -06:00
irs_documents.xlsx Add irs.gov seeds from Gary Price 2024-11-21 13:24:36 -06:00
LOC-seeds-for-eot-20240712.xlsx Add Library of Congress bulk seed list 2024-08-01 09:50:06 -05:00
MEDICAID 2020-2024.xlsx 3 new bulk lists submitted by Gary Price (#19) 2024-12-16 14:11:03 -06:00
Military-Departments-A-Z-List.csv Add more files from web resources 2024-09-12 15:54:23 -05:00
NLMRecommendationsEOT2024.xlsx Add NLM seed list 2024-11-22 13:18:59 -06:00
Performance.gov-equity-hermann-wu-20241219.xlsx bulk list submitted 20241219 by Ailsa Hermann-Wu (#21) 2024-12-19 15:00:38 -06:00
PURL_server_domains_20240214_non_gov_mil.csv Adding seed lists from NARA and in-scope non gov/mil PURL target domain csv 2024-09-06 16:10:50 -05:00
PURL_server_domains_20240214.csv Add seed lists from GPO 2024-02-16 14:13:29 -06:00
README.md Update README.md 2024-12-20 09:19:42 -06:00
RoundupListsforEOT.txt Add two lists supplied by James Jacobs 2024-10-25 16:48:38 -05:00
site-scanning-target-url-list.csv Add more files from web resources 2024-09-12 15:54:23 -05:00
sitemaps.txt Forgot to add sitemaps.txt 2024-12-10 20:36:47 -05:00
Sustainability-gov-Hermann-Wu-20241220.xlsx Add files via upload 2024-12-20 09:24:08 -06:00
US_Digital_Registry.csv Add seeds from https://touchpoints.app.cloud.gov/registry 2024-09-12 12:19:09 -05:00
us-government-website-directory.csv Add more files from web resources 2024-09-12 15:54:23 -05:00
usagov.csv Add usagov.csv seed list 2024-09-23 11:10:35 -05:00
USDA_FIS_ERS.xlsx another bulk seed list from Gary Price (USDA) (#18) 2024-12-12 15:38:05 -06:00
Violation_Tracker_unique_infosource_URLs.csv Add seed list from EDGI 2024-11-14 09:54:06 -06:00

End of Term 2024 Seed Lists

Posted here are seed lists used in the 2024 End of Term Web Archive project. Provenance notes are included below. These lists will be uploaded into the End of Term Bulk Nomination Tool.

Common Crawl Foundation seeds

See commoncrawl/ccf-eot-seeds-2024 for details.

  • ccf-gov-federal-web-graph-2024-jun-jul-aug.txt - all .gov federal hostnames from current-federal.csv domains in CCF's 2024 June/July/August web graph
  • ccf-mil-web-graph-2024-jun-jul-aug.txt - all .mil hostnames from CCF's 2024 June/July/August web graph

Defenders of Wildlife seeds

Seeds submitted by Andrew Carter on behalf of Defenders of Wildlife:

  • EoT archive submission - DoW 12-19-24.txt

Environmental Data & Governance Initiative (EDGI) seeds

Seeds supplied by Gretchen Gehrke of EDGI:

  • Violation_Tracker_unique_infosource_URLs.csv - list of seeds supplied by an EDGI collaborator.

U.S. Government Publishing Office (GPO) seeds

Seeds supplied by Dorothy Bower of the U.S. Government Publishing Office:

  • FDLP_WEb_Archiveseed_list_20240212.csv - list of seeds from the FDLP Web Archive with one page only seeds deleted, that were mainly embedded youtube videos.
  • PURL_server_domains_20240214.csv - report of all target domains from the PURL server; some determined to be out of scope were not included in the Nomination Tool.
    • PURL_server_domains_20240214_non_gov_mil.csv - non .gov/.mil seeds from the PURL_server_domains_20240214.csv list that were determined to be in scope by Mark Phillips of UNT.

infoDOCKET seeds

Seed lists produced by Gary Price, editor of infoDOCKET:

  • infodocket-11-21-2024.xlsx - from Gary Price.
  • irs_documents.xlsx - list of irs.gov document seeds.
  • DNI and CFPB.xslx
  • DOJ, White_House, DEA, ATF, FBI.xlsx
  • FDA_letters_releases_approvals.xlsx
  • HHS 2020-.xlsx
  • HRSA (2020-).xlsx
  • BLM 2020-2024.xlsx - 2544 entries from the Bureau of Land Management. Most but not all PDFs. Along with the usual techniques, a number of extra searches were done to find documents that include terms like ANWR, oil, fracking, etc.
  • USDA_FIS_ERS.xlsx. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service. A few xlsx urls too.
  • IARPA 2020-Present.xlsx - IARPA.gov 406 seeds HTML and PDF 2020-Present.
  • ARPA-H.xlsx - APRA-H.gov 412 Seeds HTML and PDF 2020-Present.
  • MEDICAID 2020-2024.xlsx - Medicaid.gov 1983 seeds PDF and a few XLSX 2020-Present.

Internet Archive seeds

Seeds supplied by Antoine McGrath of Internet Archive:

  • bsky_gov_urlverified.txt - URLs for official US Senate.gov and House.gov bluesky accounts.
  • CRS_ReportsList.csv - nominated URLs to government hosted CRS Reports from Daniel Schuman with the American Governance Institute.

Library of Congress seeds

  • LOC-seeds-for-eot-20240712.xlsx

National Archives and Records Administration seeds

Seeds supplied by Elizabeth England of the U.S. National Archives and Records Administration (NARA):

  • 117th_House_Seeds.xlsx - contains five sheets, one each for: House members, majority committees, minority committees, caucuses, and leadership/support/other.
  • 117th_Senate_Seeds.xlsx
  • 118th_House_Seed_List.xlsx - contains five sheets, one each for: House members, majority committees, minority committees, caucuses, and leadership/support/other.
  • 118th_Senate.xlsx

National Library of Medicine

Seeds supplied by Christie Moffatt of the National Library of Medicine:

  • NLMRecommendationsEOT2024.xlsx - highest priority are federal seeds recommended for NLM's Sexual and Gender Minority Health web archive, which have been identified by NIH's Sexual and Gender Minority Research Office.

Stanford seeds

Seeds supplied by James Jacobs of Stanford University Libraries:

  • FOIA_Libraries_Dataset_Oct_3_2023_Final.xlsx - spreadsheet with seeds for all of the federal FOIA libraries. Lisa DeLuca, who collated the list, said it would be fine to use her spreadsheet from https://works.bepress.com/lisa_deluca/59/.
  • govdoc-l-seeds-2024.txt - seeds from documents/sites recommended on the govdoc-l Listserv 2020 - 2024.

University of California San Diego seeds

Seeds supplied by Kelly L. Smith, Government Information Librarian and Librarian for Urban Studies & Planning / Environmental Studies at UC San Diego Library (via James Jacobs):

  • govspeakeot080124.xlsx - list of all the live URLs from Smith's GovSpeak acronym and abbreviation guide.
  • RoundupListsforEOT.txt
  • govspeakurls1124.txt - updated list of govspeak links, about 250 new items added since the August list; also, CDC, ED, and a couple other agencies have done significant reorganization of their websites since then
  • eot_lgbtqandmisc.txt - 4300+ urls for the EOT project. Most of these were identified by the small group working on lgbtq+ pages and some others from my libguide pages -- the Roe v. Wade links, a lot of the stats/data sites from the Data Is Plural federal list, weekly roundups

Seeds submitted to eot-info@archive.org

  • Federal URLs linked to on EnergyFundsForAll.org.xlsx - Submitted by Sally Robertson, EnergyFundsForAll.org
  • Hermann-Wu-nps-20241209.txt - NPS seeds submitted by Ailsa Hermann-Wu
  • GAO-hermann-wu-20241218.xlsx - GAO seeds submitted by Ailsa Hermann-Wu
  • Performance.gov-equity-hermann-wu-20241219.xlsx - seeds submitted by Ailsa Hermann-Wu on 20241219 centered around Performance.gov -- these are all PDFs of agency equity action plans or AANHPI plans.
  • Sustainability-gov-Hermann-Wu-20241220.xlsx - spreadsheet of PDF links (climate/sustainability plans and scorecards) from Sustainability.gov, excluding only the ones that are already listed in the URL Nomination Tool.

Seeds sourced from Web resources

The End of Term Web Archive team and other contributors compiled a list of sources on the Web from which to source seeds: