* ccf-gov-federal-web-graph-2024-jun-jul-aug.txt - all .gov federal hostnames from current-federal.csv domains in CCF's 2024 June/July/August web graph
* ccf-mil-web-graph-2024-jun-jul-aug.txt - all .mil hostnames from CCF's 2024 June/July/August web graph
### Defenders of Wildlife seeds
Seeds submitted by Andrew Carter on behalf of Defenders of Wildlife:
Seeds supplied by Dorothy Bower of the U.S. Government Publishing Office:
* FDLP_WEb_Archiveseed_list_20240212.csv - list of seeds from the FDLP Web Archive with one page only seeds deleted, that were mainly embedded youtube videos.
* PURL_server_domains_20240214.csv - report of all target domains from the PURL server; some determined to be out of scope were not included in the Nomination Tool.
* PURL_server_domains_20240214_non_gov_mil.csv - non .gov/.mil seeds from the PURL_server_domains_20240214.csv list that were determined to be in scope by Mark Phillips of UNT.
* BLM 2020-2024.xlsx - 2544 entries from the Bureau of Land Management. Most but not all PDFs. Along with the usual techniques, a number of extra searches were done to find documents that include terms like ANWR, oil, fracking, etc.
### National Archives and Records Administration seeds
Seeds supplied by Elizabeth England of the U.S. National Archives and Records Administration (NARA):
* 117th_House_Seeds.xlsx - contains five sheets, one each for: House members, majority committees, minority committees, caucuses, and leadership/support/other.
* 118th_House_Seed_List.xlsx - contains five sheets, one each for: House members, majority committees, minority committees, caucuses, and leadership/support/other.
Seeds supplied by Christie Moffatt of the National Library of Medicine:
* NLMRecommendationsEOT2024.xlsx - highest priority are federal seeds recommended for NLM's Sexual and Gender Minority Health web archive, which have been identified by NIH's Sexual and Gender Minority Research Office.
* FOIA_Libraries_Dataset_Oct_3_2023_Final.xlsx - spreadsheet with seeds for all of the federal FOIA libraries. Lisa DeLuca, who collated the list, said it would be fine to use her spreadsheet from https://works.bepress.com/lisa_deluca/59/.
Seeds supplied by Kelly L. Smith, Government Information Librarian and Librarian for Urban Studies & Planning / Environmental Studies at UC San Diego Library (via James Jacobs):
* govspeakeot080124.xlsx - list of all the live URLs from Smith's [GovSpeak acronym and abbreviation guide](https://ucsd.libguides.com/govspeak/home).
* govspeakurls1124.txt - updated list of govspeak links, about 250 new items added since the August list; also, CDC, ED, and a couple other agencies have done significant reorganization of their websites since then
* eot_lgbtqandmisc.txt - 4300+ urls for the EOT project. Most of these were identified by the small group working on lgbtq+ pages and some others from my libguide pages -- the Roe v. Wade links, a lot of the stats/data sites from the Data Is Plural federal list, weekly roundups
### Seeds submitted to eot-info@archive.org
* Federal URLs linked to on EnergyFundsForAll.org.xlsx - Submitted by Sally Robertson, EnergyFundsForAll.org
* Performance.gov-equity-hermann-wu-20241219.xlsx - seeds submitted by Ailsa Hermann-Wu on 20241219 centered around Performance.gov -- these are all PDFs of agency equity action plans or AANHPI plans.
* Sustainability-gov-Hermann-Wu-20241220.xlsx - spreadsheet of PDF links (climate/sustainability plans and scorecards) from Sustainability.gov, excluding only the ones that are already listed in the URL Nomination Tool.
* US_Digital_Registry.csv - CSV file generated on 9/11/2024 by Praneeth Rikka at UNT from the data at the [Touchpoints U.S. Digital Registry](https://touchpoints.app.cloud.gov/registry).
* Military-Departments-A-Z-List.csv - CSV file generated on 9/11/2024 by Lauren Ko at UNT from the data of the [U.S. Department of Defense's A-Z List](https://www.defense.gov/Resources/Military-Departments/A-Z-List/).
* current-federal.csv - Pulled from Cybersecurity and Infrastructure Security Agency's [dotgov-data repo](https://github.com/cisagov/dotgov-data) (via https://raw.githubusercontent.com/cisagov/dotgov-data/main/current-federal.csv on 9/12/2024).
* site-scanning-target-url-list.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/raw/main/data/site-scanning-target-url-list.csv on 9/12/2024).
* us-government-website-directory.csv - Pulled from [GSA's federal-website-directory repo](https://github.com/GSA/federal-website-directory) (via https://raw.githubusercontent.com/GSA/federal-website-directory/main/us-government-website-directory.csv on 9/12/2024). The repo README indicates "The Federal Website Directory is a comprehensive list of the public-facing websites of the U.S. Federal Government, spanning all three branches".
* dotmil_websites.csv - Pulled from [GSA's federal-website-index repo](https://github.com/GSA/federal-website-index) (via https://github.com/GSA/federal-website-index/blob/main/data/dataset/dotmil_websites.csv on 9/12/2024)
* 2_govt_urls_federal_only.csv - Pulled from [GSA's govt-urls repo](https://github.com/GSA/govt-urls/) (via https://raw.githubusercontent.com/GSA/govt-urls/main/2_govt_urls_federal_only.csv on 9/12/2024). The README indicates the repo "contains the list of public government managed domains that exist outside of the top-level .gov and .mil domains."
* CDC html URLs from sitemap data - 20241201.csv - file of about 46,000 .html URLs created by parsing the CDC's sitemap file at https://www.cdc.gov/wcms-auto-sitemap-index.xml, which then pointed to other sitemaps, which pointed to .html files.
* CDC found PDFs 20241209 cleaned single file.csv - .gov PDF links obtained from webpages found on the US CDC website. It contains 46,873 links, with the format: the source HTML file containing the PDF link; the time in UTC in which the accessibility of the PDF file was confirmed; and a URL pointing to the PDF file itself. PDF links are deduped when multiple pages point to the same PDF, and link fragments are removed. All the PDF files have had their accessibility and content type verified with a HTTP HEAD request on Dec. 09 2024.
* sitemaps.txt - List of federal website sitemap URLs discovered programmatically by Bentley Hensel via robots.txt files and common sitemap URL paths. URLs scraped from these sitemaps are organized by files named by hostname in the sitemap-url-seeds directory. This directory contains 2,749 files listing more than 56 million URLs as well as `_report_6_FINAL.txt` that gives statistics about the file sizes and URL counts in each of the files. URLs from the sitemap-url-seeds files will NOT be loaded into the eth2024_bulk Nomination Tool instance due to size.