{"id":50084,"date":"2021-09-26T07:05:39","date_gmt":"2021-09-26T04:05:39","guid":{"rendered":"https:\/\/tentaclii.wordpress.com\/?p=50084"},"modified":"2021-09-26T07:05:39","modified_gmt":"2021-09-26T04:05:39","slug":"bulk-pdf-download-from-archive-org","status":"publish","type":"post","link":"https:\/\/jurn.link\/tentaclii\/index.php\/2021\/09\/26\/bulk-pdf-download-from-archive-org\/","title":{"rendered":"Bulk PDF download from Archive.org"},"content":{"rendered":"<p>How to bulk download a disparate set of PDFs from the Internet Archive? The following workflow may be useful for those downloading sets of publications.<\/p>\n<p>If what you want is already in a neat and discreet &#8216;Collection&#8217; then you&#8217;re in luck. There is already a <a href=\"https:\/\/chrome.google.com\/webstore\/detail\/archive-downloader\/elhoagejfapekjaefenmngphliikoace?hl=en\">Collection Downloader<\/a>. Possibly there are others.<\/p>\n<p>If the set is not in a Collection and you don&#8217;t want to make one, or is otherwise jumbled and problematic, this roundabout solution will work.<\/p>\n<p><strong>1.<\/strong> Install the Web browser add-on <a href=\"https:\/\/chrome.google.com\/webstore\/detail\/copy-open-tab-urls\/gpmnhkeajnnnkjkgcopciocmdcdkapbh\">Copy Open Tab URLs<\/a>. This can copy the URLs of all open tabs to a list with &#8216;one URL per line&#8217;.<\/p>\n<p><strong>2.<\/strong> Visit Archive.org and find all your uncollected issues of a journal or &#8216;zine run and its successors. Quickly right-click\/open each one into a new Web browser tab. I am assuming there are perhaps 40-70 scattered issues, not hundreds or thousands.  And that your PC memory and Web browser can handle that many open tabs without crashing.<\/p>\n<p><strong>3.<\/strong> Using your new Web browser addon &#8216;Copy Open Tab URLs&#8217;, instantly capture all the open tabs into a one-URL-per-line list.  <\/p>\n<p><strong>4.<\/strong> Paste the resulting list into <a href=\"https:\/\/www.jurn.link\/tentaclii\/oldimages\/archive_org_pdf_getter.xlsx\">my URL converter .XLS spreadsheet<\/a>. This takes advantage of the fixed format that Archive.org URLs and PDFs links have. For instance&#8230;<\/p>\n<p><em>https:\/\/archive.org\/ details\/fluffy_kitty_tales<\/em><\/p>\n<p>The PDF at this page this will almost always be&#8230;<\/p>\n<p><em>https:\/\/archive.org\/ <strong>download<\/strong>\/fluffy_kitty_tales\/<strong>fluffy_kitty_tales.pdf<\/strong><\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.jurn.link\/tentaclii\/oldimages\/tales-1024x47.jpg\" alt=\"\" width=\"529\" height=\"24\" class=\"alignnone size-large wp-image-50535\" \/><\/p>\n<p>As you can see in the above spreadsheet, the hidden formulas in the spreadsheet automatically fix the URLs. Copy the final &#8216;fixed&#8217; list of links from the spreadsheet. Save the list to a plain .TXT file.<\/p>\n<p>I could have done this with a regex, but people are more familiar with the .XLS spreadsheets in Microsoft Office.<\/p>\n<p><strong>5.<\/strong> Now use a simple bit of freeware that will just download a list of files. The well-established free Chrome\/Firefox browser addon <a href=\"https:\/\/chrome.google.com\/webstore\/detail\/downthemall\/nljkibfhlpcnanjgbnlnbjecgicbjkge?hl=en\">DownThemAll!<\/a> will do the job, with a bit of initial wrangling. First set it to go directly to its Manager when opened.<\/p>\n<p>Then in the Manager right-click somewhere, and &#8220;Import from file&#8221;.  Select your list of PDF links.<\/p>\n<p>Ok, you should be done. Start the downloads running, keep your browser open, and go off and do something else. Because it&#8217;s going to take a <em>long<\/em> time.<\/p>\n<p>When finished check the list for any &#8216;404&#8217; PDFs. They may be a few where the URL failed, and they will need to be manually downloaded from the page.<\/p>\n<hr>\n<p>Ideally, the .torrent file linked in each tab would be extracted instead, loaded up to your torrent software, and then <em>just the .PDF file<\/em> in each torrent set running and nothing else.  But how one would do that in a bulk\/automated manner, I don&#8217;t know.  And it&#8217;s possible that the cross-file slaloming of a .torrent means most of the other files also get downloaded anyway, in effect. So maybe straight .PDF links is the best way.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to bulk download a disparate set of PDFs from the Internet Archive? The following workflow may be useful for &hellip;<\/p>\n<p><a href=\"https:\/\/jurn.link\/tentaclii\/index.php\/2021\/09\/26\/bulk-pdf-download-from-archive-org\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[27],"tags":[],"class_list":["post-50084","post","type-post","status-publish","format-standard","hentry","category-unnamable"],"_links":{"self":[{"href":"https:\/\/jurn.link\/tentaclii\/index.php\/wp-json\/wp\/v2\/posts\/50084","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jurn.link\/tentaclii\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jurn.link\/tentaclii\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jurn.link\/tentaclii\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jurn.link\/tentaclii\/index.php\/wp-json\/wp\/v2\/comments?post=50084"}],"version-history":[{"count":0,"href":"https:\/\/jurn.link\/tentaclii\/index.php\/wp-json\/wp\/v2\/posts\/50084\/revisions"}],"wp:attachment":[{"href":"https:\/\/jurn.link\/tentaclii\/index.php\/wp-json\/wp\/v2\/media?parent=50084"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jurn.link\/tentaclii\/index.php\/wp-json\/wp\/v2\/categories?post=50084"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jurn.link\/tentaclii\/index.php\/wp-json\/wp\/v2\/tags?post=50084"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}