From e71db0fedc12cfd35fe344c5b4d9ba096bff9a5a Mon Sep 17 00:00:00 2001 From: Rawiri Blundell Date: Sat, 15 Apr 2023 23:23:07 +1200 Subject: [PATCH] Update README.md --- README.md | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index a6aa23e..7126380 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,44 @@ # wiki.bash-hackers.org Extraction of wiki.bash-hackers.org from the Wayback Machine -This is targeting pages that have been captured by the Wayback Machine that specifically have '?do=edit' on the end of their URL. This gives us the markdown source. +This is targeting pages that have been captured by the Wayback Machine that specifically have `'?do=edit'` on the end of their URL. This gives us the markdown source. See the incomplete script "archive_crawler" to see my working. -- TODO: Second crawl -- TODO: Filter out all the non-markdown garbage. It looks like everything up to `
`, and everything after `
` is a good first cull. +- TODO: Markdown linting +- TODO: Parse the already downloaded files for any missing links -# LICENSE +## Extracting the markdown +So the pages that have `'?do-edit'` on the end of their URL appear to have a reliable and predictable structure: + +```bash +[ LINES ABOVE REMOVED FOR BREVITY ] +
+
+
+
+
+[REST OF LINE REMOVED FOR BREVITY] + +[ TARGET MARKDOWN CODE EXISTS HERE] + + +
+
+
+[ LINES BELOW REMOVED FOR BREVITY ] +``` + +So basically, we remove everything from the first line to the line that contains `name="sectok"`, and then we remove everything after ``, and what's left should be the markdown that we want. + +## LICENSE As per the original wiki.bash-hackers.org: > Except where otherwise noted, content on this wiki is licensed under the following license: > [GNU Free Documentation License 1.3](https://web.archive.org/web/20220930131429/http://www.gnu.org/licenses/fdl-1.3.html) -# COPYRIGHT +## COPYRIGHT The original copyright belongs to Jan Schampera (TheBonsai) and subsequent contributors, 2007 - 2023.