Update README with comments about non-markdown filtering

This commit is contained in:
Rawiri Blundell 2023-04-15 00:02:52 +12:00
parent 50e7ca2385
commit cae5b29740

View File

@ -5,8 +5,8 @@ This is targeting pages that have been captured by the Wayback Machine that spec
See the incomplete script "archive_crawler" to see my working. See the incomplete script "archive_crawler" to see my working.
TODO: Second crawl - TODO: Second crawl
TODO: Filter out all the non-markdown garbage. - TODO: Filter out all the non-markdown garbage. It looks like everything up to `<div class="editBox" role="application">`, and everything after `</div><!-- /content --></div>` is a good first cull.
# LICENSE # LICENSE