AWOL releases cleaned A-Z URL list.

AWOL has a fascinating post today. It’s on the attempts to identify which AWOL linked resources have already been ingested into major long-term Web archives, and which haven’t. As part of that experiment Charles and his helpmate Ryan have offered their readers a nice big cleaned A-Z list of the “52,020 unique URLs” linked from AWOL, which is very good of them. I might clip these URLs back and de-duplicate, and then do a side-by-side sheet with JURN’s own indexing URLs and thus see what’s missing from JURN. Very little in terms of post-1945 journal articles, I suspect, though there may be some I’ve missed.

Of course a JURN Search already runs across the AWOL pages, as well as a great many of the post-war full-text originals (via Google). But if I were an Ancient History scholar I might now be tempted to get together with others to crowdfund a mass download of AWOL’s full-text, so that I could search across the full-text locally and minutely, without having to rely on Google etc. I reckon the entire set of AWOL full-text would fit on a 1.5Tb external drive and would cost around $10,000 to harvest by hand/eye. Why would that be needed? I’m assuming that many long-term Web archives are ‘dark’ or that license complications mean no single archive can ingest the entirety of what AWOL points to.

My calculations for the $10k figure start with the fact that a little over 10,000 of AWOL’s 52,020 URLs are straight-to-PDF links, and so very easily downloaded by a harvesting bot. Assuming an average of 5Mb per PDF, that means about 260Gb of disk storage space for those PDFs.

If one then assumes that perhaps 10,000 of the URLs are not going to articles (rather to such things as sites that show scans of original source manuscripts and old books that display in zoomable and frame-nested forms etc, huge datasets, that are difficult to extract and archive), then that might leave 32,000 URLs that are mostly likely to be links to either journal TOCs pages or individual articles.

Let’s assume that each of the 32,000 TOC page URLs lead to an average of 16 articles and reviews (though some 2,000 may be home-page links sitting above links to issue TOCs). So 32,000 = 512,000 articles of some kind, in PDF or HTML, on average weighing 1.5Mb each. So that’s 768Gb in total. In that case one might easily store all the AWOL-discovered full-text on an $80 1.5Tb external disk, and have space to spare for the desktop indexing software‘s own index, which would be fairly big. That is a product that I might find very useful, if I were an Ancient History student, specialist, or independent scholar without access to university databases.

But how to harvest those 512,000 articles? The brute force way would be to parcel up the 32,000 URLs into parcels of 150 each. That’s 230 parcels x 150 URLs. If one were paying 20 cents per URL to Indian freelancers, to go in and spend 3 minutes grabbing whatever articles are hanging off each of those 150 page URLs, plus the page, then that would cost $37 per parcel. Let’s say $40, with a small quality bonus. Let’s say it takes four hours to do the 150 URLs and not miss anything. So that’s $10 U.S. a hour — pretty good for an Indian freelancer with broadband, I don’t think anyone would be being exploited on that deal. So the whole 32,000 URL set would cost $9,200 to harvest by hand and eye, which seems well within the range of a small crowdfunding campaign.

Of course, it might be that the articles could be wholly or partly harvested by bot. But I suspect that a simple “page + anything it links to” harvest would bring in a lot of chaff alongside the articles, given the very varied and non-standard nature of what AWOL links to. Perhaps that wouldn’t matter in practice, when keyword searching across the entire harvest. Or one might be able to use a more intelligent bot, one using Google Scholar-like article-detection algorithms.

News from JURN

~ search tool for open access content

AWOL releases cleaned A-Z URL list.

Leave a Reply Cancel reply