eot2024

hyperreal/eot2024

mirror of https://github.com/end-of-term/eot2024 synced 2025-04-26 17:03:26 -05:00

Author	SHA1	Message	Date
YakShaver	5a83e824d4	cdc datasets urls from sitemap (#26 ) CDC dataset URLs derived from this sitemap file: https://s3.amazonaws.com/sa-socrata-sitemaps-us-east-1-fedramp-prod/sitemaps/sitemap-datasets-data.cdc.gov0.xml. The format of this CSV file is: sitemap file the URL is sourced from,the URL. Note that the pages pointed to by these URLs usually include a download button to get a CSV of the dataset, but the CSVs themselves aren't included in this seed file and won't be retrieved in the crawl. But at least the existence of the dataset and links to its metadata will be documented in the archive.	2025-01-06 10:05:19 -06:00
Lauren Ko	ee6cba5868	Update README for FDA and NIH sitemap seed lists	2025-01-02 12:32:17 -06:00
YakShaver	a3d96841db	Adds FDA and NIH HTML URLs for seed list (#25 ) * FDA HTML urls for seed list This is a file of URLs derived from the FDA's sitemap.xml file. It has URLs of the form /media//download and the warning-letters filtered out, so in theory everything in this file is an HTML link. I plan on submitting the PDF content separately. The format of this CSV file is: <sitemap file the URL is sourced from>,<the URL> FDA warning letters from sitemap.xml This is a file of URLs derived from the FDA's sitemap.xml file with warning letters content. I thought this was going to be PDF content, but it turns out to be HTML content. The format of this CSV file is: sitemap file the URL is sourced from,the URL * FDA download links from sitemaps This is a file of URLs derived from the FDA's sitemap.xml file, where the link is of the form /media/id/download. These resolve to a PDF file rendered in a HTML wrapper using Mozilla's pdfjs library -- so the download in the path name is a little misleading. The format of this CSV file is: sitemap file the URL is sourced from,the URL * NIH urls from three sitemaps. This is a file of URLs derived from three NIH sitemap files: https://www.nih.gov/sitemap.xml, https://newsinhealth.nih.gov/sitemap.xml and https://nihrecord.nih.gov/sitemap.xml. The format of this CSV file is: sitemap file the URL is sourced from,the URL	2025-01-02 11:33:40 -06:00
Lauren Ko	526cb84dd0	Add State Department FOIA PDF seed lists to README	2025-01-02 11:26:40 -06:00
Antoine McGrath	9c4b1910fd	Add files via upload (#24 ) Added State Department FOIA URLs The 458,130 URLs included in the files USAStateFOIA_pdf_urls_part1.txt and USAStateFOIA_pdf_urls_part2.txt are derived from the US State Departments Freedom of Information Act (FOIA) Virtual Reading Room search database. The website serves thousands of Collections as Search Pages (2,940) and 455,190 endpoint PDFs. 2,940 Search Pages Examples http://foia.state.gov/Search/Results.aspx?collection=Clinton_Email_February_29_Release http://foia.state.gov/Search/Results.aspx?collection=Litigation_F-2016-07895_6 https://foia.state.gov/Search/Results.aspx?caseNumber=F-1991-05139 https://foia.state.gov/Search/Results.aspx?IRIA.aspx https://foia.state.gov/Search/Results.aspx?Microfiche.aspx 455,190 PDF URLs Examples https://foia.state.gov/DOCUMENTS/1-FY2012/F-2004-02207/DOC_0C17731327/C17731327.pdf https://foia.state.gov/DOCUMENTS/FOIA_Micro_Aug2024_6/F-1986-01832/DOC_0C09000001/C09000001.pdf https://foia.state.gov/DOCUMENTS/FOIA_Micro_Oct2024_7/F-1989-00718/DOC_0C09000006/C09000006.pdf https://foia.state.gov/DOCUMENTS/Litigation/HRCLitigation_1/JW7 RD4 02-24-2014 - 1 of 7.sdhdhpdf_Part1.pdf https://foia.state.gov/DOCUMENTS\Argentina\0000AFA2.pdf	2025-01-02 11:22:58 -06:00
Lauren Ko	3e95bc46a9	Tweak seed lists README	2025-01-02 11:02:08 -06:00
James R. Jacobs	59b04f37a8	Add 6 bulk lists (#23 ) * Update README.md 2 new lists from EDGI and Hermann-Wu submitted via dot-info@archive.org * Add files via upload 2 new lists from EDGI and Hermann-Wu submitted via dot-info@archive.org * Update README.md bulk list re USDA seeds submitted by AWI 20241222 * Add files via upload bulk list re USDA seeds submitted by AWI 20241222 * Update README.md 3 bulk lists added: 1 from Gary Price and 2 from eot-info submissions: AWI-XL-4-20241224.xlsx, AWI-USDA-FSIS-20241222.xlsx, NSF-20241224.xlsx * Add files via upload 3 bulk lists added: 1 from Gary Price and 2 from eot-info submissions: AWI-XL-4-20241224.xlsx, AWI-USDA-FSIS-20241222.xlsx, NSF-20241224.xlsx * Update README.md list submitted by Natl Indian Law Library Natl-Indian-Law-Library-bulk-seeds-20241224.xlsx * Add files via upload list submitted by Natl Indian Law Library Natl-Indian-Law-Library-bulk-seeds-20241224.xlsx	2025-01-02 10:56:39 -06:00
James R. Jacobs	06bfdd7bcd	Add files via upload Sustainability-gov-Hermann-Wu-20241220.xlsx Signed-off-by: Lauren Ko <lauren.ko@unt.edu>	2024-12-20 09:24:08 -06:00
James R. Jacobs	bf7cf89659	Update README.md added bulk list Sustainability-gov-Hermann-Wu-20241220.xlsx	2024-12-20 09:19:42 -06:00
Lauren Ko	86a3364700	Add Defenders of Wildlife seeds	2024-12-19 15:30:47 -06:00
James R. Jacobs	0f565c94e4	bulk list submitted 20241219 by Ailsa Hermann-Wu (#21 ) * Update README.md bulk list on performance.gov by Ailsa Hermann-Wu * Add files via upload Bulk list submitted by Ailsa Hermann-Wu re performance.gov 20241219	2024-12-19 15:00:38 -06:00
James R. Jacobs	f1694a635c	bulk seed list of GAO seeds by Ailsa Hermann-Wu (#20 ) * Update README.md bulk seed list of GAO seeds sent by Ailsa Hermann-Wu * Add files via upload bulk seed list of GAO seeds sent by Ailsa Hermann-Wu	2024-12-18 10:14:24 -06:00
James R. Jacobs	ef3bd7d5f9	3 new bulk lists submitted by Gary Price (#19 ) * Update README.md added another bulk list from Gary Price/Infodocket. File is USDA_FIS_ERS.xlsx * Add files via upload USDA_FIS_ERS.xlsx from Gary Price/infodocket. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service. * Update README.md 3 more bulk lists from Gary Price sent on 12/14/2024 * Add files via upload 3 new bulk lists from Gary Price submitted 12/14/2024	2024-12-16 14:11:03 -06:00
James R. Jacobs	ed7cabab8e	another bulk seed list from Gary Price (USDA) (#18 ) * Update README.md added another bulk list from Gary Price/Infodocket. File is USDA_FIS_ERS.xlsx * Add files via upload USDA_FIS_ERS.xlsx from Gary Price/infodocket. 1700 or so urls from the USDA. Specifically, the Food Inspection Service and Economic Research Service.	2024-12-12 15:38:05 -06:00
Lauren Ko	97a727fc4e	Update README for sitemaps.txt and sitemap-url-seeds directory	2024-12-11 13:21:44 -06:00
Lauren Ko	94e610e8e1	Merge pull request #14 from TheBoatyMcBoatFace/main * Batch commit of sitemap URL seeds under 500MB or 250 files * Batch commit of sitemap URL seeds under 500MB or 250 files * Batch commit of sitemap URL seeds under 500MB or 250 files * Batch commit of sitemap URL seeds under 500MB or 250 files * Batch commit of sitemap URL seeds under 500MB or 250 files * Batch commit of sitemap URL seeds under 500MB or 250 files * Batch commit of sitemap URL seeds under 500MB or 250 files * Batch commit of sitemap URL seeds under 500MB or 250 files * Batch commit of sitemap URL seeds under 500MB or 250 files * Batch commit of sitemap URL seeds under 500MB or 250 files * Batch commit of sitemap URL seeds under 500MB or 250 files * Forgot to add sitemaps.txt Signed-off-by: Bentley Hensel <bentleyhensel@gmail.com> --------- Signed-off-by: Bentley Hensel <bentleyhensel@gmail.com>	2024-12-11 12:58:16 -06:00
Bentley Hensel	bad24fe745	Forgot to add sitemaps.txt Signed-off-by: Bentley Hensel <bentleyhensel@gmail.com>	2024-12-10 20:36:47 -05:00
Bentley Hensel	e2054da35c	Merge pull request #3 from end-of-term/main update	2024-12-10 20:33:49 -05:00
Lauren Ko	3b3bf304b9	Update README for CDC PDFs	2024-12-10 14:55:01 -06:00
YakShavingAsAService	f4b194553a	PDFs from the CDC website - single file (#17 ) This is a csv file of PDF links obtained from webpages found on the US CDC website. It contains 46,873 links, with the format: the source HTML file containing the PDF link; the time in UTC in which the accessibility of the PDF file was confirmed; and a URL pointing to the PDF file itself. This file replaces the two previous files. This file has had the PDF links deduped, so if multiple pages point to the same PDF, you'll only see an entry for the first reference. PDF links that point to non-gov domains have been omitted as well.If the PDF link contains a fragment, the fragment will be removed from the path (e.g. "/a/path/mypdf.pdf#page=3" will get turned into "/a/path/mypdf.pdf"). All the PDF files have had their accessibility and content type verified with a HTTP HEAD request on Dec. 09 2024.	2024-12-10 14:51:36 -06:00
James R. Jacobs	5a9195431e	bulk list of NPS seeds submitted by Hermann-Wu - Hermann-Wu-nps-20241209.txt (#16 ) * Update README.md added NPS seeds submitted by Hermann-Wu - Hermann-Wu-nps-20241209.txt * Add files via upload NPS seeds submitted by Hermann-Wu - Hermann-Wu-nps-20241209.txt * Update README.md edited the contact section.	2024-12-10 12:48:14 -06:00
Bentley Hensel	4bef8b223d	Merge pull request #2 from end-of-term/main Add some Bureau of Land Management and EnergyFundsForAll.org seeds (#15)	2024-12-09 15:59:46 -05:00
Lauren Ko	ed4d0f0d8a	Add some Bureau of Land Management and EnergyFundsForAll.org seeds (#15 ) * Update README.md * Update README.md * Add files via upload * Update README.md added bulk file from EnergyFundsForAll.org * Bulk list from EnergyFundsForAll * Remove extra whitespace Signed-off-by: Lauren Ko <lauren.ko@unt.edu> * Remove duplicate listing of infodocket-11-21-2024.xls --------- Signed-off-by: Lauren Ko <lauren.ko@unt.edu> Co-authored-by: James R. Jacobs <freegovinfo@gmail.com>	2024-12-09 14:28:43 -06:00
Bentley Hensel	7a74ece080	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 18:41:07 -05:00
Bentley Hensel	bd3fdbde47	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 18:38:34 -05:00
Bentley Hensel	49aee9c7bc	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 18:37:48 -05:00
Bentley Hensel	bf267e339e	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 18:37:09 -05:00
Bentley Hensel	c015b8b98d	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 18:35:50 -05:00
Bentley Hensel	980fa37e2a	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 18:33:46 -05:00
Bentley Hensel	4042707213	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 18:33:25 -05:00
Bentley Hensel	0535ad7cf2	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 18:32:01 -05:00
Bentley Hensel	73719faa91	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 18:29:05 -05:00
Bentley Hensel	4d70936a23	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 18:19:10 -05:00
Bentley Hensel	aeea7beac2	Batch commit of sitemap URL seeds under 500MB or 250 files	2024-12-05 17:56:32 -05:00
Lauren Ko	a6e38c7311	Add CDC .html seed list	2024-12-03 15:53:17 -06:00
Lauren Ko	47e8f8eb67	Add Bluesky URL Co-authored-by: Melody Joy Kramer <melodykramer@gmail.com>	2024-12-03 12:34:32 -06:00
James R. Jacobs	d633f6965c	uploaded new bulk seed files from Gary Price and Kelly Smith (#11 ) * adding info docket bulk seed list * Update README.md * Update README.md * Add files via upload Bulk lists from Gary Price and Kelly Smith. Seed list readme updated with file names.	2024-12-02 12:11:12 -06:00
Lauren Ko	4519cb1ee8	Add NLM seed list	2024-11-22 13:18:59 -06:00
Lauren Ko	37b32203c5	Add irs.gov seeds from Gary Price	2024-11-21 13:24:36 -06:00
James R. Jacobs	58e14710e3	pull requests for info docket bulk list 11-21-2024 (#5 ) * adding info docket bulk seed list * Update README.md	2024-11-21 12:56:50 -06:00
Lauren Ko	7e3d04ed8c	Update README for bsky_gov_urlverified.txt	2024-11-21 08:54:52 -06:00
Antoine McGrath	01662e4c87	Create bsky_gov_urlverified.txt (#4 ) URLs for official US Senate.gov and House.gov bluesky accounts	2024-11-21 08:49:00 -06:00
Lauren Ko	3a14a8fb3f	Add seed list from EDGI	2024-11-14 09:54:06 -06:00
Lauren Ko	8e8c22e358	Add updated govspeak list	2024-11-08 09:36:35 -06:00
Lauren Ko	a325cf3f79	Add two lists supplied by James Jacobs	2024-10-25 16:48:38 -05:00
Lauren Ko	99460625a9	Add usagov.csv seed list	2024-09-23 11:10:35 -05:00
Greg Lindahl	ba124bec62	Common Crawl seeds (#3 ) * Common Crawl Foundation seeds * clean mil list to just hostnames * doc: add location of ccf repo that generated these files --------- Co-authored-by: Greg Lindahl <greg@commomncrawl.org>	2024-09-16 09:33:58 -05:00
Lauren Ko	4392d90188	Add more files from web resources	2024-09-12 15:54:23 -05:00
Lauren Ko	b9dfb4f189	Add seeds from https://touchpoints.app.cloud.gov/registry	2024-09-12 12:19:09 -05:00
Lauren Ko	1b1b4736b4	Add NARA's 118th House Seeds	2024-09-09 16:24:56 -05:00

1 2

61 Commits