mirror of
https://github.com/end-of-term/eot2024
synced 2024-11-25 15:53:41 +01:00
63 lines
5.1 KiB
Markdown
63 lines
5.1 KiB
Markdown
# End of Term 2024 Seed Lists
|
|
|
|
Posted here are seed lists used in the 2024 End of Term Web Archive project.
|
|
Provenance notes are included below. These lists will be uploaded into the
|
|
[End of Term Bulk Nomination Tool](https://digital2.library.unt.edu/nomination/eth2024_bulk/).
|
|
|
|
### Common Crawl Foundation seeds
|
|
|
|
See [commoncrawl/ccf-eot-seeds-2024](https://github.com/commoncrawl/ccf-eot-seeds-2024) for details.
|
|
|
|
* ccf-gov-federal-web-graph-2024-jun-jul-aug.txt -- all .gov federal hostnames from current-federal.csv domains in CCF's 2024 June/July/August web graph
|
|
* ccf-mil-web-graph-2024-jun-jul-aug.txt -- all .mil hostnames from CCF's 2024 June/July/August web graph
|
|
|
|
### GPO seeds
|
|
Seeds supplied by Dorothy Bower of the U.S. Government Publishing Office:
|
|
|
|
* FDLP_WEb_Archiveseed_list_20240212.csv - list of seeds from the FDLP Web Archive with one page only seeds deleted, that were mainly embedded youtube videos.
|
|
* PURL_server_domains_20240214.csv - report of all target domains from the PURL server; some determined to be out of scope were not included in the Nomination Tool.
|
|
* PURL_server_domains_20240214_non_gov_mil.csv - non .gov/.mil seeds from the PURL_server_domains_20240214.csv list that were determined to be in scope by Mark Phillips of UNT.
|
|
|
|
### Internet Archive seeds
|
|
Seeds supplied by Antoine McGrath of Internet Archive:
|
|
|
|
* CRS_ReportsList.csv - nominated URLs to government hosted CRS Reports from Daniel Schuman with the American Governance Institute.
|
|
|
|
### Library of Congress seeds
|
|
|
|
* LOC-seeds-for-eot-20240712.xlsx
|
|
|
|
### National Archives and Records Administration seeds
|
|
|
|
Seeds supplied by Elizabeth England of the U.S. National Archives and Records Administration (NARA):
|
|
|
|
* 117th_House_Seeds.xlsx - contains five sheets, one each for: House members, majority committees, minority committees, caucuses, and leadership/support/other.
|
|
* 117th_Senate_Seeds.xlsx
|
|
* 118th_House_Seed_List.xlsx - contains five sheets, one each for: House members, majority committees, minority committees, caucuses, and leadership/support/other.
|
|
* 118th_Senate.xlsx
|
|
|
|
### Stanford seeds
|
|
Seeds supplied by James Jacobs of Stanford University Libraries:
|
|
|
|
* FOIA_Libraries_Dataset_Oct_3_2023_Final.xlsx - spreadsheet with seeds for all of the federal FOIA libraries. Lisa DeLuca, who collated the list, said it would be fine to use her spreadsheet from https://works.bepress.com/lisa_deluca/59/.
|
|
* govdoc-l-seeds-2024.txt - seeds from documents/sites recommended on the govdoc-l Listserv 2020 - 2024.
|
|
|
|
### UC San Diego
|
|
Seeds supplied by Kelly L. Smith, Government Information Librarian and Librarian for Urban Studies & Planning / Environmental Studies at UC San Diego Library (via James Jacobs):
|
|
|
|
* govspeakeot080124.xlsx - list of all the live URLs from Smith's [GovSpeak acronym and abbreviation guide](https://ucsd.libguides.com/govspeak/home).
|
|
* RoundupListsforEOT.txt
|
|
* govspeakurls1124.txt - updated list of govspeak links, about 250 new items added since the August list; also, CDC, ED, and a couple other agencies have done significant reorganization of their websites since then
|
|
|
|
### Seeds sourced from Web resources
|
|
The End of Term Web Archive team compiled a list of sources on the Web from which to source seeds:
|
|
|
|
* US_Digital_Registry.csv - CSV file generated on 9/11/2024 by Praneeth Rikka at UNT from the data at the [Touchpoints U.S. Digital Registry](https://touchpoints.app.cloud.gov/registry).
|
|
* Military-Departments-A-Z-List.csv - CSV file generated on 9/11/2024 by Lauren Ko at UNT from the data of the [U.S. Department of Defense's A-Z List](https://www.defense.gov/Resources/Military-Departments/A-Z-List/).
|
|
* current-federal.csv - Pulled from Cybersecurity and Infrastructure Security Agency's [dotgov-data repo](https://github.com/cisagov/dotgov-data) (via https://raw.githubusercontent.com/cisagov/dotgov-data/main/current-federal.csv on 9/12/2024).
|
|
* site-scanning-target-url-list.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/raw/main/data/site-scanning-target-url-list.csv on 9/12/2024).
|
|
* us-government-website-directory.csv - Pulled from [GSA's federal-website-directory repo](https://github.com/GSA/federal-website-directory) (via https://raw.githubusercontent.com/GSA/federal-website-directory/main/us-government-website-directory.csv on 9/12/2024). The repo README indicates "The Federal Website Directory is a comprehensive list of the public-facing websites of the U.S. Federal Government, spanning all three branches".
|
|
* dotmil_websites.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/blob/main/data/dataset/dotmil_websites.csv on 9/12/2024)
|
|
* 2_govt_urls_federal_only.csv - Pulled from [GSA's govt-urls repo](https://github.com/GSA/govt-urls/) (via https://raw.githubusercontent.com/GSA/govt-urls/main/2_govt_urls_federal_only.csv on 9/12/2024). The README indicates the repo "contains the list of public government managed domains that exist outside of the top-level .gov and .mil domains."
|
|
* usagov.csv - Seeds scraped from https://www.usa.gov/agency-index/ by Jake Abrams, Founder, CivicsUS, LLC.
|