# wiki.bash-hackers.org Extraction of wiki.bash-hackers.org from the Wayback Machine This is targeting pages that have been captured by the Wayback Machine that specifically have `'?do=edit'` on the end of their URL. This gives us the Dokuwiki Markup source. See the incomplete script "archive_crawler" to see my working. - TODO: Parse the already downloaded files for any missing links - TODO: Markdown conversion from Dokuwiki Markup to GitHub Markdown using pandoc - TODO: Markdown linting - TODO: Rinse and repeat ## Getting the latest capture URL from archive.org archive.org does not appear to have an exhaustive API, but it does have at least one API call that is relevant to our needs. It's called `available`. How it works is you run a `curl -X GET` against `https://archive.org/wayback/available?url=[YOUR URL HERE]` and it returns to you with a bit of JSON. For example: ```bash curl -s -X GET "https://archive.org/wayback/available?url=https://wiki.bash-hackers.org/howto/mutex?do=edit" | jq -r '.' { "url": "https://wiki.bash-hackers.org/howto/mutex?do=edit", "archived_snapshots": { "closest": { "status": "200", "available": true, "url": "http://web.archive.org/web/20220615023742/https://wiki.bash-hackers.org/howto/mutex?do=edit", "timestamp": "20220615023742" } } } ``` The path `'.archived_snapshots.closest.url'` will, therefore, either have a URL of a latest capture, or it will return `null`. Because different pages have different timestamps in their capture, you can't just do something like: ```bash curl -s -X GET "http://web.archive.org/web/20220615023742/https://wiki.bash-hackers.org/some/other/page?do=edit" ``` Because the timestamp in the URL may or may not match. So we use this API call to locate genuine resources. ## Extracting the Dokuwiki Markup So the pages that have `'?do-edit'` on the end of their URL appear to have a reliable and predictable structure: ```bash [ LINES ABOVE REMOVED FOR BREVITY ]

[REST OF LINE REMOVED FOR BREVITY] [ TARGET DOKUWIKI MARKUP CODE EXISTS HERE]

[ LINES BELOW REMOVED FOR BREVITY ] ``` So basically, we remove everything from the first line to the line that contains `name="sectok"`, and then we remove everything after ``, and what's left should be the Dokuwiki Markup that we want. ## LICENSE As per the original wiki.bash-hackers.org: > Except where otherwise noted, content on this wiki is licensed under the following license: > [GNU Free Documentation License 1.3](https://web.archive.org/web/20220930131429/http://www.gnu.org/licenses/fdl-1.3.html) ## COPYRIGHT The original copyright belongs to Jan Schampera (TheBonsai) and subsequent contributors, 2007 - 2023. It's extremely important to me that copyright and attribution are given where required - the original contributors are worth their dues, and IIRC I'm one of them. If you're one of the original contributors and you believe I've violated your copyright in anyway, please let me know in the first instance.