How to archive a recalcitrant forum

Task: To download and safely archive a useful but very recalcitrant user-forum, one that may be at risk of going offline.

Roadblocks:

1) The forum archives can only be accessed by drop-downs that require you to input precise from-to dates (see above). Harvesters / bots cannot get past such barriers, and cannot reach the forum’s ‘deep history’ of per-post threads.

2) Even if you had the individual URL of each and every forum thread, only a proper Web browser can get and archive each forum thread URL. Automated harvesters / bots / capture utilities are quickly blocked by the forum’s server.

3) AutoIT or the newer AutoHotKey might be a solution on Windows, by calling Internet Explorer to load the URLs and then save each as a file. But my intensive searches find only arcane code fragments, and one code function. Nothing complete or even part-way complete.

The following solution thus requires a bit of manual work, though not too much. It is for a relatively small forum or sub-forum, of technical coding advice (in this case Python for 3D software) without a great weight of images being posted. In this case there are 16 master pages of links to some 500 actual forum posts, and each post has user replies appended. Each post displays as a single scrolling page and is not paginated.

Solution:

1. Find the earliest forum thread date, then manually go through and create a per-year page that show the links to the forum threads. Save it, and also any continuation pages there may be for that year. Work through the years and, on a long-standing forum or sub-forum, you may perhaps end up with some 15-20 saved HTML pages. It should not take more than a few minutes.

2. Extract a big list of all the links in these locally saved HTML pages. I used Sobolsoft’s ‘Extract Links from Multiple HTML Pages’ Windows utility to do this, but there are other bulk link extractors.

3. Save the extracted one-per-line links list to a .TXT file, copy-paste that list to Excel and sort the list A-Z. From this sorted list you extract just the links that point to the forum threads. They should have a uniform path and pattern, allowing them to be easily identified and extracted. Save the new list to a further .TXT file.

4. Use the free Chrome-based Web browser extension DownThemAll! to load the new list .TXT (Web browser | start DownThemAll! | right-click anywhere | ‘Import from file’). You may also want to set DownThemAll! to only download one forum thread at a time (Web browser | start DownThemAll! | Cog icon in DownThemAll!’s lower right | Network | Concurrent downloads: 1).

Have DownThemAll! do the downloads. Very regrettably there is no way to have DownThemAll! save the pages from the browser to .MHT (.MHTML) or .PDF files. Just the same format as the target URLs point to.

5. Because you’re using your normal Web browser and only downloading one page/post at a time, use of DownThemAll! should not trigger any traffic blocking from the targeted forum.

Great, so you have the forum threads downloaded as .HTML files. Of course, there’s a problem. The .HTML pages being saved locally are not also saving the images. When you load one of these HTML forum pages locally, the Web browser is still loading the post’s images from the online forum server. That’s good, but we need a more permanent local file being saved.

6. The only solution I found for the next bit is the Pale Moon browser (very worthy, based on Firefox) and its free MozArchiver add-on. This add-on appears to be unique, in terms of being happy to save all open tabs (rather than just one). It saves each open tab as a portable .MHT file with embedded images. You will have to be brave though, and load 50-80 tabs at a time by drag-dropping the .html files onto Pale Moon. With my RAM and workstation, I find Pale Moon has no problem with 80 at a time. After drag-drop, pause to let the tabs all load. Then “save all tabs” to .MHTML files, which is quickly done.

It’s thus relatively easy to use this method to work through 500 or so locally-saved forums post-pages, provided they were not too image-heavy.

Then when done with each batch in Pale Moon, right-click on the left-most tab and “close all tabs to the right”. Repeat until finished.

That’s it. A slightly tedious workflow, but your recalcitrant and harvester-phobic user forum is now safely archived as portable .MHT files, one per forum thread. Good local indexing/search software (DocFetcher, DTSearch etc) should have no problem indexing local .MHT files, ready for you to do keyword searches across the local archive.

If you ever need to convert the .MHT (.MHTML) files back, the Windows freeware MHTML Converter 1.1 will do that and has batch processing.

News from JURN

~ search tool for open access content

How to archive a recalcitrant forum

Leave a Reply Cancel reply