Bulk PDF download from Archive.org

How to bulk download a disparate set of PDFs from the Internet Archive? The following workflow may be useful for those downloading sets of publications.

If what you want is already in a neat and discreet ‘Collection’ then you’re in luck. There is already a Collection Downloader. Possibly there are others.

If the set is not in a Collection and you don’t want to make one, or is otherwise jumbled and problematic, this roundabout solution will work.

1. Install the Web browser add-on Copy Open Tab URLs. This can copy the URLs of all open tabs to a list with ‘one URL per line’.

2. Visit Archive.org and find all your uncollected issues of a journal or ‘zine run and its successors. Quickly right-click/open each one into a new Web browser tab. I am assuming there are perhaps 40-70 scattered issues, not hundreds or thousands. And that your PC memory and Web browser can handle that many open tabs without crashing.

3. Using your new Web browser addon ‘Copy Open Tab URLs’, instantly capture all the open tabs into a one-URL-per-line list.

4. Paste the resulting list into my URL converter .XLS spreadsheet. This takes advantage of the fixed format that Archive.org URLs and PDFs links have. For instance…

https://archive.org/ details/fluffy_kitty_tales

The PDF at this page this will almost always be…

https://archive.org/ download/fluffy_kitty_tales/fluffy_kitty_tales.pdf

As you can see in the above spreadsheet, the hidden formulas in the spreadsheet automatically fix the URLs. Copy the final ‘fixed’ list of links from the spreadsheet. Save the list to a plain .TXT file.

I could have done this with a regex, but people are more familiar with the .XLS spreadsheets in Microsoft Office.

5. Now use a simple bit of freeware that will just download a list of files. The well-established free Chrome/Firefox browser addon DownThemAll! will do the job, with a bit of initial wrangling. First set it to go directly to its Manager when opened.

Then in the Manager right-click somewhere, and “Import from file”. Select your list of PDF links.

Ok, you should be done. Start the downloads running, keep your browser open, and go off and do something else. Because it’s going to take a long time.

When finished check the list for any ‘404’ PDFs. They may be a few where the URL failed, and they will need to be manually downloaded from the page.

Ideally, the .torrent file linked in each tab would be extracted instead, loaded up to your torrent software, and then just the .PDF file in each torrent set running and nothing else. But how one would do that in a bulk/automated manner, I don’t know. And it’s possible that the cross-file slaloming of a .torrent means most of the other files also get downloaded anyway, in effect. So maybe straight .PDF links is the best way.

Tentaclii

~ News and scholarship on H.P. Lovecraft (1890–1937)

Bulk PDF download from Archive.org

Leave a Reply Cancel reply