eot2024/seed-lists/README.md

63 lines
5.1 KiB
Markdown
Raw Normal View History

# End of Term 2024 Seed Lists
Posted here are seed lists used in the 2024 End of Term Web Archive project.
Provenance notes are included below. These lists will be uploaded into the
[End of Term Bulk Nomination Tool](https://digital2.library.unt.edu/nomination/eth2024_bulk/).
2024-02-16 21:13:29 +01:00
2024-09-23 18:10:35 +02:00
### Common Crawl Foundation seeds
See [commoncrawl/ccf-eot-seeds-2024](https://github.com/commoncrawl/ccf-eot-seeds-2024) for details.
* ccf-gov-federal-web-graph-2024-jun-jul-aug.txt -- all .gov federal hostnames from current-federal.csv domains in CCF's 2024 June/July/August web graph
* ccf-mil-web-graph-2024-jun-jul-aug.txt -- all .mil hostnames from CCF's 2024 June/July/August web graph
2024-02-16 21:13:29 +01:00
### GPO seeds
Seeds supplied by Dorothy Bower of the U.S. Government Publishing Office:
* FDLP_WEb_Archiveseed_list_20240212.csv - list of seeds from the FDLP Web Archive with one page only seeds deleted, that were mainly embedded youtube videos.
* PURL_server_domains_20240214.csv - report of all target domains from the PURL server; some determined to be out of scope were not included in the Nomination Tool.
* PURL_server_domains_20240214_non_gov_mil.csv - non .gov/.mil seeds from the PURL_server_domains_20240214.csv list that were determined to be in scope by Mark Phillips of UNT.
2024-05-08 23:37:50 +02:00
2024-06-04 17:42:30 +02:00
### Internet Archive seeds
Seeds supplied by Antoine McGrath of Internet Archive:
* CRS_ReportsList.csv - nominated URLs to government hosted CRS Reports from Daniel Schuman with the American Governance Institute.
2024-08-01 16:50:06 +02:00
### Library of Congress seeds
* LOC-seeds-for-eot-20240712.xlsx
### National Archives and Records Administration seeds
Seeds supplied by Elizabeth England of the U.S. National Archives and Records Administration (NARA):
* 117th_House_Seeds.xlsx - contains five sheets, one each for: House members, majority committees, minority committees, caucuses, and leadership/support/other.
* 117th_Senate_Seeds.xlsx
2024-09-09 23:24:56 +02:00
* 118th_House_Seed_List.xlsx - contains five sheets, one each for: House members, majority committees, minority committees, caucuses, and leadership/support/other.
* 118th_Senate.xlsx
### Stanford seeds
2024-05-08 23:37:50 +02:00
Seeds supplied by James Jacobs of Stanford University Libraries:
* FOIA_Libraries_Dataset_Oct_3_2023_Final.xlsx - spreadsheet with seeds for all of the federal FOIA libraries. Lisa DeLuca, who collated the list, said it would be fine to use her spreadsheet from https://works.bepress.com/lisa_deluca/59/.
2024-10-25 23:48:38 +02:00
* govdoc-l-seeds-2024.txt - seeds from documents/sites recommended on the govdoc-l Listserv 2020 - 2024.
2024-08-01 18:59:00 +02:00
### UC San Diego
Seeds supplied by Kelly L. Smith, Government Information Librarian and Librarian for Urban Studies & Planning / Environmental Studies at UC San Diego Library (via James Jacobs):
* govspeakeot080124.xlsx - list of all the live URLs from Smith's [GovSpeak acronym and abbreviation guide](https://ucsd.libguides.com/govspeak/home).
2024-10-25 23:48:38 +02:00
* RoundupListsforEOT.txt
2024-11-08 16:36:35 +01:00
* govspeakurls1124.txt - updated list of govspeak links, about 250 new items added since the August list; also, CDC, ED, and a couple other agencies have done significant reorganization of their websites since then
### Seeds sourced from Web resources
The End of Term Web Archive team compiled a list of sources on the Web from which to source seeds:
2024-09-12 22:54:23 +02:00
* US_Digital_Registry.csv - CSV file generated on 9/11/2024 by Praneeth Rikka at UNT from the data at the [Touchpoints U.S. Digital Registry](https://touchpoints.app.cloud.gov/registry).
* Military-Departments-A-Z-List.csv - CSV file generated on 9/11/2024 by Lauren Ko at UNT from the data of the [U.S. Department of Defense's A-Z List](https://www.defense.gov/Resources/Military-Departments/A-Z-List/).
* current-federal.csv - Pulled from Cybersecurity and Infrastructure Security Agency's [dotgov-data repo](https://github.com/cisagov/dotgov-data) (via https://raw.githubusercontent.com/cisagov/dotgov-data/main/current-federal.csv on 9/12/2024).
* site-scanning-target-url-list.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/raw/main/data/site-scanning-target-url-list.csv on 9/12/2024).
* us-government-website-directory.csv - Pulled from [GSA's federal-website-directory repo](https://github.com/GSA/federal-website-directory) (via https://raw.githubusercontent.com/GSA/federal-website-directory/main/us-government-website-directory.csv on 9/12/2024). The repo README indicates "The Federal Website Directory is a comprehensive list of the public-facing websites of the U.S. Federal Government, spanning all three branches".
* dotmil_websites.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/blob/main/data/dataset/dotmil_websites.csv on 9/12/2024)
* 2_govt_urls_federal_only.csv - Pulled from [GSA's govt-urls repo](https://github.com/GSA/govt-urls/) (via https://raw.githubusercontent.com/GSA/govt-urls/main/2_govt_urls_federal_only.csv on 9/12/2024). The README indicates the repo "contains the list of public government managed domains that exist outside of the top-level .gov and .mil domains."
2024-09-23 18:10:35 +02:00
* usagov.csv - Seeds scraped from https://www.usa.gov/agency-index/ by Jake Abrams, Founder, CivicsUS, LLC.