# wiki.bash-hackers.org | :bulb: See [start.md](start.md) to get straight to business | | --- | The popular wiki.bash-hackers.org had its DNS expire in April 2023, with the owner seemingly being incommunicado. It looked like this domain would be in the region of €1k to purchase - ouch. Fortunately, archive.org has snapshotted this website, and so we can extract wiki.bash-hackers.org from archive.org's Wayback Machine. Additionally, the web server behind wiki.bash-hackers.org is still running, for now, so we can use an entry in our `hosts` file (`/etc/hosts` on *nix, `c:\Windows\System32\Drivers\etc\hosts` on Windows) that reads: ```bash 83.243.40.67 wiki.bash-hackers.org ``` This repo is targeting pages that have been captured by the Wayback Machine that specifically have `?do=edit` on the end of their URL. These pages give us the Dokuwiki Markup source, relatively unmolested - maybe with a bit of errant html to strip. We then convert the original source to GitHub markdown. See the incomplete script "archive_crawler" to see my working. I would not recommend blindly running it - it's beta quality at best. Just read it and this page to follow the logic... or just fork this repo... or whatever, I'm not your Dad. - TODO: Markdown linting and transformations - TODO: Perhaps add a "This was downloaded from [wayback url here] on [date]" to each page... - TODO: Import page-edit history as git log entries? `?do=revisions` is the secret sauce there... ## Getting the latest capture URL from archive.org archive.org does not appear to have an exhaustive API, but it does have at least one API call that is relevant to our needs. It's called `available`. How it works is you run a `curl -X GET` against `https://archive.org/wayback/available?url=[YOUR URL HERE]` and it returns to you with a bit of JSON. For example: ```bash curl -s -X GET "https://archive.org/wayback/available?url=https://wiki.bash-hackers.org/howto/mutex?do=edit" | jq -r '.' { "url": "https://wiki.bash-hackers.org/howto/mutex?do=edit", "archived_snapshots": { "closest": { "status": "200", "available": true, "url": "http://web.archive.org/web/20220615023742/https://wiki.bash-hackers.org/howto/mutex?do=edit", "timestamp": "20220615023742" } } } ``` The path `'.archived_snapshots.closest.url'` will, therefore, either have a URL of a latest capture, or it will return `null`. Because different pages have different timestamps in their capture, you can't just do something like: ```bash curl -s -X GET "http://web.archive.org/web/20220615023742/https://wiki.bash-hackers.org/some/other/page?do=edit" ``` Because, obviously, the timestamp in the URL may or may not match. So we use this API call to validate that a desired resource exists, and if so, locate the latest available copy of it. ## Extracting the Dokuwiki Markup So the pages that have `'?do-edit'` on the end of their URL appear to have a reliable and predictable structure: ```bash [ LINES ABOVE REMOVED FOR BREVITY ]