• Directory
  • FAQ: about JURN
  • Group tests
  • Guide to academic search
  • JURN’s donationware
  • Links
  • openEco: titles indexed

News from JURN

~ search tool for open access content

News from JURN

Monthly Archives: August 2015

AWOL releases cleaned A-Z URL list.

18 Tuesday Aug 2015

Posted by futurilla in Economics of Open Access, My general observations, Spotted in the news

≈ Leave a comment

AWOL has a fascinating post today. It’s on the attempts to identify which AWOL linked resources have already been ingested into major long-term Web archives, and which haven’t. As part of that experiment Charles and his helpmate Ryan have offered their readers a nice big cleaned A-Z list of the “52,020 unique URLs” linked from AWOL, which is very good of them. I might clip these URLs back and de-duplicate, and then do a side-by-side sheet with JURN’s own indexing URLs and thus see what’s missing from JURN. Very little in terms of post-1945 journal articles, I suspect, though there may be some I’ve missed.

Of course a JURN Search already runs across the AWOL pages, as well as a great many of the post-war full-text originals (via Google). But if I were an Ancient History scholar I might now be tempted to get together with others to crowdfund a mass download of AWOL’s full-text, so that I could search across the full-text locally and minutely, without having to rely on Google etc. I reckon the entire set of AWOL full-text would fit on a 1.5Tb external drive and would cost around $10,000 to harvest by hand/eye. Why would that be needed? I’m assuming that many long-term Web archives are ‘dark’ or that license complications mean no single archive can ingest the entirety of what AWOL points to.

My calculations for the $10k figure start with the fact that a little over 10,000 of AWOL’s 52,020 URLs are straight-to-PDF links, and so very easily downloaded by a harvesting bot. Assuming an average of 5Mb per PDF, that means about 260Gb of disk storage space for those PDFs.

If one then assumes that perhaps 10,000 of the URLs are not going to articles (rather to such things as sites that show scans of original source manuscripts and old books that display in zoomable and frame-nested forms etc, huge datasets, that are difficult to extract and archive), then that might leave 32,000 URLs that are mostly likely to be links to either journal TOCs pages or individual articles.

Let’s assume that each of the 32,000 TOC page URLs lead to an average of 16 articles and reviews (though some 2,000 may be home-page links sitting above links to issue TOCs). So 32,000 = 512,000 articles of some kind, in PDF or HTML, on average weighing 1.5Mb each. So that’s 768Gb in total. In that case one might easily store all the AWOL-discovered full-text on an $80 1.5Tb external disk, and have space to spare for the desktop indexing software‘s own index, which would be fairly big. That is a product that I might find very useful, if I were an Ancient History student, specialist, or independent scholar without access to university databases.

But how to harvest those 512,000 articles? The brute force way would be to parcel up the 32,000 URLs into parcels of 150 each. That’s 230 parcels x 150 URLs. If one were paying 20 cents per URL to Indian freelancers, to go in and spend 3 minutes grabbing whatever articles are hanging off each of those 150 page URLs, plus the page, then that would cost $37 per parcel. Let’s say $40, with a small quality bonus. Let’s say it takes four hours to do the 150 URLs and not miss anything. So that’s $10 U.S. a hour — pretty good for an Indian freelancer with broadband, I don’t think anyone would be being exploited on that deal. So the whole 32,000 URL set would cost $9,200 to harvest by hand and eye, which seems well within the range of a small crowdfunding campaign.

Of course, it might be that the articles could be wholly or partly harvested by bot. But I suspect that a simple “page + anything it links to” harvest would bring in a lot of chaff alongside the articles, given the very varied and non-standard nature of what AWOL links to. Perhaps that wouldn’t matter in practice, when keyword searching across the entire harvest. Or one might be able to use a more intelligent bot, one using Google Scholar-like article-detection algorithms.

Element Hiding Helper updates, changes

16 Sunday Aug 2015

Posted by futurilla in JURN tips and tricks, Spotted in the news

≈ Leave a comment

AdBlock Plus’s Element Hiding Helper has updated. It no longer resides on the right-click mouse menu. You need to enable the top menu bar button for AdBlock (View|Toolbars|Customise), then it launches from a drop-down from that icon.

The new method of selecting a block to hide takes a minute of getting used to. If you can comprehend nested HTML code at a glance then it’s not necessarily easier than before, since it’s now trickier to identify the master container DIV for the whole block you want to hide. However, other users will probably find it a bit easier and more visual to use.

Element Hiding Helper is useful for customising “noisy” websites such as newspaper front pages, which blast you with celebrity news sidebars, scrolling tickers, sports sections and other regular items you never read.

JURN’s eco-titles directory page

14 Friday Aug 2015

Posted by futurilla in My general observations

≈ Leave a comment

Checked and repaired the linkrot on the 400+ URLs in the preliminary directory of ecology/nature related titles indexed by JURN. Revised the corresponding indexing URLs in the main JURN database, if needed.

Google experiment indexes images from PDFs

11 Tuesday Aug 2015

Posted by futurilla in How to improve academic search

≈ Leave a comment

Google indexes images from PDF files. Fairly limited at present, possibly because the pictures all seem to be drawn from a small set of 500 PDFs stored at http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/. But I’d guess that, as Google’s machine-learning algorithms get tuned up on this, we may start to see the service expanded to extract and serve images from wild PDFs. I wonder if there will be a Creative Commons filter for images from open access research PDFs? I also wonder if this may enhance the size of the image pool accessible via JURN’s new Image Search feature?

maps

[ Hat-tip: ResearchBuzz ]

Xenu’s Link Sleuth as a desktop replacement for LinkBot Pro

10 Monday Aug 2015

Posted by futurilla in JURN tips and tricks

≈ 1 Comment

Xenu’s Link Sleuth is a free desktop Windows linkbot (Web link checker). It’s quite possibly the only one now available, at least for those who prefer not to punt their URLs into a Web-based service. Xenu’s Link Sleuth has been out of development since September 2010, but is still more up to date than the old Linkbot Pro 5.x (which needs to be run in Windows XP compatibility mode on newer versions of Windows). Xenu LS looks very similar to Linkbot and it works in much the same manner on Windows 8.x. It has about the same speed, maybe a little quicker — but that may be because it is more impatient on waiting for timeouts.

Xenu LS can “treat redirections as errors”, often very useful for detecting moved pages, but this feature may need to be enabled in the Advanced panel. It’s not as useful as Linkbot in this regard, because it doesn’t show you the new and old URLs side-by-side. Just the original URL and an “object permanently moved” or “object temporarily moved” flag. This makes it harder to detect if an OJS journal installation, for instance, is just passing a visitor over to the “current issue” page of a journal or is sending visitors to a more anomalous URL.

Sadly some large sites, such as Hathi, block being visited by a Xenu installation. Presumably this is because they see it as a species of URL harvester. The way that Xenu identifies itself to servers cannot be spoofed, unfortunately. Thankfully the number of such URLs seems to be very small (Hathi, CIA, uMich.edu, for me).

Update: LinkChecker, a free open-source desktop link checker for Windows.

Update: Screaming Frog SEO Spider as a Linkbot replacement. At 2021 it has a free version which will check up to 500 links at a time. Far more advanced than Linkchecker.

My new Microsoft Publisher magazine template

08 Saturday Aug 2015

Posted by futurilla in My general observations

≈ Leave a comment

I’m pleased to announce that the JURN blog now has a new page offering various ways to support JURN. The new page will host a (hopefully growing) range of digital download items that will help support JURN in terms of my time, the cost of hosting and general digital shoe-leather. I doubt I’ll sell more than a dozen or so downloads a year, but even that would help support JURN — and a few more sales beyond that might give me funds for marketing or to attend a UK open access conference.

cover

So, first up is my $23 Microsoft Publisher magazine template. This is a substantial 28-page royalty-free template for the popular Microsoft Office Publisher 2013 (and higher) software. It has been designed by myself, as if a quality “small town” quarterly magazine, one aimed specifically at those wanting to sustain and revive a small American town or neighbourhood. But it can also be easily adapted to suit your own special interest, business sector, university or location. Just drop in your own pictures, and paste in new texts.

You can help support JURN by posting this news to Facebook or Twitter, or by suggesting the template to anyone who uses MS Publisher.

The images below show some sample page spreads from my new template…

spread1

spread2

spread3

spread4

spread5

spread7

spread8

spread9

spread10

(Cover photography by Colin Garrow, all other pictures are Wikipedia or CC0)

Get it here.

QuiteRSS

08 Saturday Aug 2015

Posted by futurilla in JURN tips and tricks

≈ 1 Comment

Is your 2013 free FeedDemon Pro 4.5 becoming annoying, in terms of its occasional ’15 second freeze’ problem on Windows? It’s nice software but is no longer under development. So I’ve taken another look around for alternative desktop RSS readers that are under active development, seeking something a touch faster but with the same or better features.

One ‘actively developed’ option is the new-ish QuiteRSS. A basic feature set, so far, but perfectly functional. AdBlock runs by default in the internal browser, and Flash is blocked with a click-to-play button. After install, for additional security you may want to uncheck: Tools | Options | ‘Help improve QuiteRSS by sending usage information’ and disable ‘Javascript’ and ‘External plugins’ in the internal browser. You can also block internal browser pages from setting cookies.

QuiteRSS offers search within your feeds, though only at the level of a per-folder search…

quietrsssearch

Also font size and font choice, across all display panes. So there is now one acceptable actively developed desktop Windows alternative to FeedDemon Pro. Which is good to know.

The only other — albeit unacceptable — actively developed desktop option is RSSOwl 2.2.1. Update, 2025: last version was 2014, has several vulnerabilities and the maker says “don’t use it”.

Guide to free academic search – links repaired

06 Thursday Aug 2015

Posted by futurilla in My general observations

≈ Leave a comment

JURN’s “A short guide to free academic search” guidance page has been link-checked and repaired.

Repozitar

06 Thursday Aug 2015

Posted by futurilla in How to improve academic search, My general observations

≈ Leave a comment

Repozitar is a unified search tool for Czech open repositories. By default their keyword search only returns records which offer full-text. A nice touch, and it makes one wonder why the English-speaking world’s repository search tools seem to have such trouble offering this simple useful feature.

fulltext

Repozitar is associated with a searchable nationwide registry of Czech theses, seemingly part of a Masaryk University project to help detect plagiarism in theses and student papers. English abstracts appear to be common in the very detailed record pages.

Retraction Watch boosted by $400,000 grant

04 Tuesday Aug 2015

Posted by futurilla in Spotted in the news

≈ Leave a comment

Retraction Watch has been given a $400,000 grant from the John D. and Catherine T. MacArthur Foundation, “to create a comprehensive database of retractions, allowing us to hire our first staff writer”.

Depending on the form it takes this could potentially be indexed by JURN? It would have to be one retraction, one page, and have the OA status indicated in the URL path — www.database.fuz/articles/oa/article725.html

Newer posts →
RSS Feed: Subscribe

 

Please become my patron at www.patreon.com/davehaden to help JURN survive and thrive.

JURN

  • JURN : directory of ejournals
  • JURN : main search-engine
  • JURN : openEco directory
  • JURN : repository search
  • Categories

    • Academic search
    • Ecology additions
    • Economics of Open Access
    • How to improve academic search
    • JURN blogged
    • JURN metrics
    • JURN tips and tricks
    • JURN's Google watch
    • My general observations
    • New media journal articles
    • New titles added to JURN
    • Official and think-tank reports
    • Ooops!
    • Open Access publishing
    • Spotted in the news
    • Uncategorized

    Archives

    • May 2025
    • April 2025
    • December 2024
    • September 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • June 2023
    • May 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2021
    • September 2021
    • August 2021
    • July 2021
    • June 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
    • October 2014
    • September 2014
    • August 2014
    • July 2014
    • June 2014
    • May 2014
    • April 2014
    • March 2014
    • February 2014
    • January 2014
    • December 2013
    • November 2013
    • October 2013
    • September 2013
    • August 2013
    • July 2013
    • June 2013
    • May 2013
    • April 2013
    • March 2013
    • February 2013
    • January 2013
    • December 2012
    • November 2012
    • October 2012
    • September 2012
    • August 2012
    • June 2012
    • May 2012
    • April 2012
    • March 2012
    • February 2012
    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
    • July 2011
    • June 2011
    • May 2011
    • April 2011
    • March 2011
    • February 2011
    • January 2011
    • December 2010
    • November 2010
    • October 2010
    • September 2010
    • August 2010
    • July 2010
    • June 2010
    • May 2010
    • April 2010
    • March 2010
    • February 2010
    • January 2010
    • December 2009
    • November 2009
    • October 2009
    • September 2009
    • August 2009
    • July 2009
    • June 2009
    • May 2009
    • April 2009
    • March 2009
    • February 2009

    Proudly powered by WordPress Theme: Chateau by Ignacio Ricci.