{"id":25591,"date":"2022-12-11T16:20:11","date_gmt":"2022-12-11T16:20:11","guid":{"rendered":"https:\/\/jurn.link\/jurnsearch\/?p=25591"},"modified":"2024-09-15T16:58:46","modified_gmt":"2024-09-15T16:58:46","slug":"hot-to-archive-a-recalcitrant-forum","status":"publish","type":"post","link":"https:\/\/jurn.link\/jurnsearch\/index.php\/2022\/12\/11\/hot-to-archive-a-recalcitrant-forum\/","title":{"rendered":"How to archive a recalcitrant forum"},"content":{"rendered":"<p>Task: To download and safely archive a useful but very recalcitrant user-forum, one that may be at risk of going offline.<\/p>\n<p>Roadblocks:<\/p>\n<p>1) The forum archives can only be accessed by drop-downs that require you to input precise from-to dates (see above). Harvesters \/ bots cannot get past such barriers, and cannot reach the forum\u2019s \u2018deep history\u2019 of per-post threads.<\/p>\n<p>2) Even if you had the individual URL of each and every forum thread, only a proper Web browser can get and archive each forum thread URL. Automated harvesters \/ bots \/ capture utilities are quickly blocked by the forum\u2019s server.<\/p>\n<p>3) AutoIT or the newer AutoHotKey might be a solution on Windows, by calling Internet Explorer to load the URLs and then save each as a file. But my intensive searches find only arcane code fragments, and one code function. Nothing complete or even part-way complete.<\/p>\n<p>The following solution thus requires a bit of manual work, though not too much. It is for a relatively small forum or sub-forum, of technical coding advice (in this case Python for 3D software) without a great weight of images being posted. In this case there are 16 master pages of links to some 500 actual forum posts, and each post has user replies appended. Each post displays as a single scrolling page and is not paginated.<\/p>\n<p>Solution:<\/p>\n<p>1. Find the earliest forum thread date, then manually go through and create a per-year page that show the links to the forum threads. Save it, and also any continuation pages there may be for that year. Work through the years and, on a long-standing forum or sub-forum, you may perhaps end up with some 15-20 saved HTML pages. It should not take more than a few minutes.<\/p>\n<p>2. Extract a big list of all the links in these locally saved HTML pages. I used Sobolsoft\u2019s \u2018Extract Links from Multiple HTML Pages\u2019 Windows utility to do this, but there are other bulk link extractors.<\/p>\n<p>3. Save the extracted one-per-line links list to a .TXT file, copy-paste that list to Excel and sort the list A-Z. From this sorted list you extract just the links that point to the forum threads. They should have a uniform path and pattern, allowing them to be easily identified and extracted. Save the new list to a further .TXT file.<\/p>\n<p>4. Use the free Chrome-based Web browser extension DownThemAll! to load the new list .TXT (Web browser | start DownThemAll! | right-click anywhere | \u2018Import from file\u2019). You may also want to set DownThemAll! to only download one forum thread at a time (Web browser | start DownThemAll! | Cog icon in DownThemAll!\u2019s lower right | Network | Concurrent downloads: 1).<\/p>\n<p>Have DownThemAll! do the downloads. Very regrettably there is no way to have DownThemAll! save the pages from the browser to .MHT (.MHTML) or .PDF files. Just the same format as the target URLs point to.<\/p>\n<p>5. Because you\u2019re using your normal Web browser and only downloading one page\/post at a time, use of DownThemAll! should not trigger any traffic blocking from the targeted forum.<\/p>\n<p>Great, so you have the forum threads downloaded as .HTML files. Of course, there\u2019s a problem. The .HTML pages being saved locally are not also saving the images. When you load one of these HTML forum pages locally, the Web browser is still loading the post\u2019s images from the online forum server. That\u2019s good, but we need a more permanent local file being saved.<\/p>\n<p>6. The only solution I found for the next bit is the Pale Moon browser (very worthy, based on Firefox) and its free MozArchiver add-on. This add-on appears to be unique, in terms of being happy to save all open tabs (rather than just one). It saves each open tab as a portable .MHT file with embedded images. You will have to be brave though, and load 50-80 tabs at a time by drag-dropping the .html files onto Pale Moon. With my RAM and workstation, I find Pale Moon has no problem with 80 at a time. After drag-drop, pause to let the tabs all load. Then \u201csave all tabs\u201d to .MHTML files, which is quickly done.<\/p>\n<p>It\u2019s thus relatively easy to use this method to work through 500 or so locally-saved forums post-pages, provided they were not too image-heavy.<\/p>\n<p>Then when done with each batch in Pale Moon, right-click on the left-most tab and \u201cclose all tabs to the right\u201d. Repeat until finished.<\/p>\n<p>That\u2019s it. A slightly tedious workflow, but your recalcitrant and harvester-phobic user forum is now safely archived as portable .MHT files, one per forum thread. Good local indexing\/search software (DocFetcher, DTSearch etc) should have no problem indexing local .MHT files, ready for you to do keyword searches across the local archive.<\/p>\n<p>If you ever need to convert the .MHT (.MHTML) files back, the Windows freeware MHTML Converter 1.1 will do that and has batch processing.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Task: To download and safely archive a useful but very recalcitrant user-forum, one that may be at risk of going &hellip;<\/p>\n<p><a href=\"https:\/\/jurn.link\/jurnsearch\/index.php\/2022\/12\/11\/hot-to-archive-a-recalcitrant-forum\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[],"class_list":["post-25591","post","type-post","status-publish","format-standard","hentry","category-jurn-tips-and-tricks"],"_links":{"self":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/25591","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/comments?post=25591"}],"version-history":[{"count":2,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/25591\/revisions"}],"predecessor-version":[{"id":25766,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/25591\/revisions\/25766"}],"wp:attachment":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/media?parent=25591"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/categories?post=25591"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/tags?post=25591"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}