Go to file
Rawiri Blundell abdba3a791
Update README.md
Add update about flokoe's fork/work
2023-07-15 22:21:21 +12:00
commands Minor review tweaks in response to #4 2023-04-27 08:40:50 +12:00
dict Remove .markup files 2023-04-17 22:49:22 +12:00
howto Correct internal links across all .md files 2023-04-24 23:31:29 +12:00
internals Correct internal links across all .md files 2023-04-24 23:31:29 +12:00
misc Correct html quote char 2023-04-24 23:27:29 +12:00
scripting Add badly-translated obsolete.md 2023-04-25 23:26:01 +12:00
snipplets Correct internal links across all .md files 2023-04-24 23:31:29 +12:00
syntax Correct links in conditional_expression.md 2023-04-28 09:04:13 +12:00
user/thebonsai Remove .markup files 2023-04-17 22:49:22 +12:00
archive_crawler Correct remaining html-bound links 2023-04-28 11:25:39 +12:00
bash4.md Correct internal links across all .md files 2023-04-24 23:31:29 +12:00
LICENSE First crawl of archive.org 2023-04-14 23:44:41 +12:00
need_love.md Add need_love.md 2023-04-25 23:15:19 +12:00
README.md Update README.md 2023-07-15 22:21:21 +12:00
start.md Correct remaining html-bound links 2023-04-28 11:25:39 +12:00
waybacktargets Add conditional_expression to waybacktargets 2023-04-27 09:01:38 +12:00
wishes.md Correct internal links across all .md files 2023-04-24 23:31:29 +12:00

wiki.bash-hackers.org

Update 2023/07/15:

Check out the following forked resources, which takes the next step and presents this information extraction as GH pages:

💡 See start.md to get straight to business

Update:

The web server behind wiki.bash-hackers.org is still running, for now, so we can use an entry in our hosts file (/etc/hosts on *nix, c:\Windows\System32\Drivers\etc\hosts on Windows) that reads:

83.243.40.67 wiki.bash-hackers.org

There is also a pastebin.bash-hackers.org at the same IP address.

In addition, it looks like the owner is trying to sort it out! Stay tuned!


The popular wiki.bash-hackers.org site had its DNS expire in April 2023, with the owner seemingly being incommunicado. It looked like this domain would be in the region of €1k to purchase - ouch.

Fortunately, archive.org has snapshotted this website, and so we can extract wiki.bash-hackers.org from archive.org's Wayback Machine.

This repo is targeting pages that have been captured by the Wayback Machine that specifically have ?do=edit on the end of their URL. These pages give us the Dokuwiki Markup source, relatively unmolested - maybe with a bit of errant html to strip. We then convert the original source to GitHub markdown. From there, we can front-end the content however we like, or import it to another wiki instance, or anything really.

See the incomplete script "archive_crawler" to see my working. I would not recommend blindly running it - it's beta quality at best. Just read it and this page to follow the logic... or just fork this repo... or whatever, I'm not your Dad.

  • TODO: Markdown linting and transformations
  • TODO: Perhaps add a "This was downloaded from [wayback url here] on [date]" to each page...
  • TODO: Import page-edit history as git log entries? ?do=revisions is the secret sauce there...

Getting the latest capture URL from archive.org

archive.org does not appear to have an exhaustive API, but it does have at least one API call that is relevant to our needs. It's called available.

How it works is you run a curl -X GET against https://archive.org/wayback/available?url=[YOUR URL HERE] and it returns to you with a bit of JSON. For example:

curl -s -X GET "https://archive.org/wayback/available?url=https://wiki.bash-hackers.org/howto/mutex?do=edit" | jq -r '.'
{
  "url": "https://wiki.bash-hackers.org/howto/mutex?do=edit",
  "archived_snapshots": {
    "closest": {
      "status": "200",
      "available": true,
      "url": "http://web.archive.org/web/20220615023742/https://wiki.bash-hackers.org/howto/mutex?do=edit",
      "timestamp": "20220615023742"
    }
  }
}

The path '.archived_snapshots.closest.url' will, therefore, either have a URL of a latest capture, or it will return null.

Because different pages have different timestamps in their capture, you can't just do something like:

curl -s -X GET "http://web.archive.org/web/20220615023742/https://wiki.bash-hackers.org/some/other/page?do=edit"

Because, obviously, the timestamp in the URL may or may not match. So we use this API call to validate that a desired resource exists, and if so, locate the latest available copy of it.

Extracting the Dokuwiki Markup

So the pages that have '?do-edit' on the end of their URL appear to have a reliable and predictable structure:

[ LINES ABOVE REMOVED FOR BREVITY ]
    <div class="toolbar group">
        <div id="draft__status" class="draft__status"></div>
        <div id="tool__bar" class="tool__bar"></div>
    </div>
    <form id="dw__editform" method="post" action="" accept-charset="utf-8" class=" form-inline"><div class="no">
<input type="hidden" name="sectok" value=""/><input type="hidden" name="id" value="wishes"/>[REST OF LINE REMOVED FOR BREVITY]

[ TARGET DOKUWIKI MARKUP CODE EXISTS HERE]

</textarea>
<div id="wiki__editbar" class="editBar">
<div id="size__ctl">
</div>
[ LINES BELOW REMOVED FOR BREVITY ]

So basically, we remove everything from the first line to the line that contains name="sectok", and then we remove everything after </textarea>, and what's left should be the Dokuwiki Markup that we want.

Converting to Markdown

We'll convert to Github Markdown using pandoc

Custom edits that pandoc doesn't handle will have to be figured out over time.

Note panels

The original dokuwiki source has several of the following entries:

\<note tip\>The syntax is somewhat confusing in that you would think
that the arrow would point in the direction of the copy, but it's
reversed. So it's `target>&source` effectively.\</note\>

We convert these to markdown tables with emojis .e.g

| :bulb: The syntax is somewhat confusing in that you would think
that the arrow would point in the direction of the copy, but it's
reversed. So it's `target>&source` effectively. |
| --- |

LICENSE

As per the original wiki.bash-hackers.org:

Except where otherwise noted, content on this wiki is licensed under the following license:

GNU Free Documentation License 1.3

The original copyright belongs to Jan Schampera (TheBonsai) and subsequent contributors, 2007 - 2023.

It's extremely important to me that copyright and attribution are given where required - the original contributors are worth their dues, and IIRC I'm one of them.

If you're one of the original contributors and you believe I've violated your copyright in anyway, please let me know in the first instance.