Ancient Athens

From the blogs. ‘Why I made JURN’, part 199

“…it’s so difficult to find a lot of academic articles online unless you subscribe to a service like JSTOR. It took me a good couple of hours to unearth a research article the other day using Athens/JSTOR (it was so much hassle that it would almost have been easier to go to the library).”

…and this from a savvy Web 2.0 staff-development trainer, blogging from a major British university. How must some of the less able undergraduates fare?

JURN Directory checked and refreshed

I ran the excellent Linkbot v5 link-checking software over the JURN Directory. The results are: five dead links removed (also removed one open journal that has sold out and is now located behind a paywall at SAGE); 15 moved links re-located and repaired; and I also removed a couple of links to journals that had only marginal free content.

If you are keeping a local saved copy of the JURN Directory page on your desktop (possibly to avoid the “bouncy puppy” concertina effects, which vanish if the page is saved locally), please refresh it.

Two new full-text research assistance services

A couple of new commercial start-ups in medical/scientific full-text research assistance services, offering to outsource some of the heavy-lifting for librarians — Pubget and the rather clunkily-named Mighty Linkout Machine. Amazingly, given the seemingly enormous resources poured into science journals and elite universities, these services are said to be needed because scientists and doctors are…

“frustrated by the challenge of getting full-text PDF access to science journal articles — even while working inside well-endowed institutions like Harvard and Oxford”

Giving free JSTOR access to alumni

Now here’s a nice move. Southern Illinois University is giving free JSTOR access to its alumni…

“SIU alums can access JSTOR anywhere in the country after registering on the Alumni Association Web site.”

If there was one thing that would get me back in touch with my old alma mater, after having lost touch with the alumni magazine during a few house moves, that would be it.

Workload allowances for journal editing

An interesting point from the publisher of an independent commercial academic journal…

The [ universities and their various Research Assessment Exercises ] have created, and they sustain, an academic assessment system that is very heavily dependent on academic journals, but which gives no credit whatever for the editing of such journals. The universities offer precious little encouragement (read “no material support” and “no workload allowance”) for the editing or publication of academic journals.

Common Tag and Search BOSS

This looks somewhat interesting. Just launched, Common Tag

“is an open tagging format developed to make [ Web ] content more connected, discoverable and engaging. Unlike free-text tags, Common Tags are references to unique, well-defined concepts, complete with metadata and their own URLs.”

From what I read, it sounds a bit like herding cats — attempting to persuade (firstly) bloggers and social bookmarkers to use standardised vocabularies and terminology for content tagging. I suspect it’ll find difficulties in gaining traction, simply due to the sheer size of the Web. Nice logo, though…

commont

It would be interesting to see an academic version, which could auto-read a document and suggest and automatically embed (microformat or RDFa?) tags using the A&AT terms.

And I just found out about the Yahoo Search BOSS, which seems to have been around in mature form since late 08. It’s Yahoo’s competitor to Google CSE. It seems to have appeared during their recent takeover troubles, which doesn’t inspire confidence. However, it’s getting new features and appears to be under active development. New sorting functions have apparently been added to BOSS, offering sorting by date and/or a specified time range (although it seems that may be limited to custom News search?). There’s also a Python-driven mashup feature, although at present people seem to be using this to add rather naff-looking context-aware sidebars alongside search-results. There’s also a kicker in the small print…

In the near future, we will be introducing a fee structure for BOSS

If sorting by date was a feature that could be added to Google CSE results, and a keyword-targetted RSS feed was then allowed to run from that sorting, JURN could feed you a usable approximation of a rolling keyword-specific table-of-contents alert from 3,000+ ejournals. Does the current standard open access ejournal publishing software allow that sort of cross-journal alerting service, I wonder?

Getting only the free articles into JURN

Someone asked about what comes into the JURN index, when a title is indexed but only offers a limited amount of free full-text or “free-sample” articles. Does the rest of the online material (link-less tables-of-contents, abstracts with no full-text links etc) from the journal also enter JURN? The answer is: no, not usually. It’s usually possible to filter at the URL level so that only the free content enters JURN. For example, by only indexing URLS such as:

http://www.journal.com/journal/sample/*.pdf

http://www.journal.edu/journalABC/documents/*.pdf

A real-world example is:

http://www.egyptpro.sci.waseda.ac.jp/pdf*/*/*.pdf

Where “*” is the Google CSE wildcard. Of course if some dimwit IT techie then decides to juggle the directory structure, it will erase the journal from JURN. But that’s a risk any directory or search-engine takes.

Sometimes a few PDFs to do with society or journal administration matters can be called into search along with the articles, if all the PDFs sit indiscriminately in a single URL path. A search for:

site:http://www.scholarly-society-journal.info/ filetype:pdf

… will usually show if there are too many of these. Google tends to bunch that sort of material at the top of site: search results. Usually there are only a dozen or so.

It’s different with the few ejournals that cheekily use standard ‘open access’ publishing software, but which actually keep recent articles locked away behind a one-year or even three-year rolling paywall. The software is not intelligent enough to place paywall article abstract pages on a different and distinctive URL path, and then to automatically transfer&bounce these when the article becomes free. But by indexing only the .pdf path in such cases, that will usually call only fulltext articles into JURN.

Open access search?

Following on from my previous post… a search for “open access” site:www.google.com/coop/ was discouraging. There are about twenty “living-dead” Custom Search Engines from 2006, but no large ones updated after 2006 (so far as I could tell from a quick visit).

Pouring out all this open access content is all very well, but where’s the competition and development in open access search?

And where are the simple common standards for flagging open content for search-engine discovery and sorting, for that matter? Judging by the structure and look of most academic repositories, internet search-engines are the last things on their minds.

Now of course I’m viewing things from the outside, as an independent curator and social entreprenuer, not a librarian or OA evangelist. But it seems to me that burying your PhD thesis deep in a repository cattle-car — seemingly with only a few keywords, an ugly template and an impenetrable URL for company — isn’t serving it or the author very well. Especially in terms of metadata and tagging leading to full-text search discovery. As the authors of “Experiences in Deploying Metadata Analysis Tools for Institutional Repositories” recently wrote in Cataloging & Classification Quarterly (No. 3/4, 2009)…

“Current institutional repository software provides few tools to help metadata librarians understand and analyse their collections.”

Which doesn’t bode well for search-engines aiming to hook into and sort the same metadata. That sort of statement might have been acceptable in 1999, but it’s a damning statement to hear from librarians in 2009. And another paper in the same issue concludes that there is…

“a pressing need for the building of a common data model that is interoperable across digital repositories”.

Now I wouldn’t know a Dublin Core from a Dublin Pint, but how difficult would it have been to build a search-engine friendly tag that allows a repository to tell the world “this is a root free-to-all full-text file” and “you’re not going to get any full-text for this title”? Or to allow the “one-click” filtering out of science and medical-related OA material across search results from a thousand repositories?

This could be done at the URL level. For example by using a standard universal URL structure that could be read by machines and humans alike. For a journal it might run something like:

   www.technology-history.org/journal-issue-004/free-full-text/2009_adams_preindustrial_water_mills.html

Where preindustrial_water_mills are the first three words of the article title.

Without even accessing the document, a human can now glance at the URL in search results and read off:

   Journal name (Technology History)
   Issue number (Number 4)
   It’s from a journal
   It’s free full-text
   The year published (2009)
   The author surname (Adams)
   The first three words of the article title (“preindustrial water mills“)

For a repository it could look something like:

   www.uni.edu/oa-repository/free-full-text/theses/history/history-of-technology/2009_adams_preindustrial_water_mills.html

And with a uniform standard for URL structures, university IT techies would not be allowed to fiddle with the directory structure and thus break the URL. All full-text files in U.S. repositories could then be searched simply by indexing one line:

http://www.*.edu/oa-repository/free-full-text/

Anyway, rant over. I did find a large Google CSE for Economics. Not much use for the arts and humanities you might think, and last updated in 2006, but due to its sheer size (23,613 sites from apparently reputable sources) searches for…

“creative economy” keyword

“creative industries” keyword

“art market” keyword

… all seem to show it still has some use as a discovery tool.

A sea of CSEs

I had a quick look around for other Google Custom Search Engines, via a simple search for:

keyword site:www.google.com/coop/

Living-dead CSEs from circa-2006 litter the results, of course. Probably made in 30 minutes during the first flush of public interest in Google’s new toy, usually indexing less than 30 items, and then seemingly forgotten about within 30 days.

I guess that’s one of the main reasons why people don’t seem to hold specialist Google CSEs in high regard. Which probably helps to explain why a search for 2009 site:www.google.com/coop/ seems to show that only a mere 39 public CSE have either been built or updated in the last six months. It seems a shame that the academic community is fiddling with often-unlovable and quickly-stale niche wikis, while such a powerful tool is all-but unused except for an occasional private one-site index. It’s not as if CSEs don’t have tools for collaborative index-building and weeding.

With a few months of careful work by a professional or subject-specialist, there’s no reason why a CSE can’t hold its head up alongside funded/commercial services — as I hope I’ve shown with JURN. And if a developer plans ahead and uses some common tools, basic maintainance of a large curated engine — once complete — shouldn’t take more than a couple of days of work per year.

I did find a few CSEs in the humanities still showing some stamina…

Theological journal search (340+ titles inc. findarticles.com, last updated Jan 2009).

Online Biblical Studies journals (123 titles, the titles freely listed, last updated 2008).

Judaic Studies in English (278 sites, last updated Sept 2007).

Alcuin Society (139 sites on bibliophilia and book arts, last updated Oct 2008).

AuseSearch (All open access academic repositories in Australia that are listed in Kennan & Kingsley at Feb 2009).

Film Blogs (139 titles, the titles freely listed, last updated June 2009. Looks like a strong tool for quickly finding genuine reviews from film-buffs, as opposed to marketing psuedo-reviews).

Busador Cultural (a large academic-cultural-arts search-engine for Spanish-language material).

So where might there be scope for a strong new curated CSE, with a nice balance of focus and scope? It might be useful to have an engine for “books still of scholarly worth, and other useful non-fiction” which selects from the ebooks that are flooding out from the out-of-copyright book digitisation projects, indexing the full-text. Books such as Tom Wedgwood, the first photographer and Kitecraft and Kite Tournaments. There has to be a more enticing way to access this stuff than getting your keywords tangled in creaky Victorian potboilers and agricultural pamphlets from 1932, or ploughing through a daily list seemingly endlessly populated by thousands of 1920s pulp novels and Victorian romances. But I’m willing to bet that there’s no flag in the metadata which says “non-fiction / just the cool stuff”, so it might take a lot of work.

Blind Search

The academic blog Walt at Random tries out a new search tool, Blind Search…

“You type in a search. You get back the first 10 results for each of three search engines, displayed in three parallel columns. You click on one of three “vote for this search engine” buttons, based on the column of results that seem to match your query best. Then, and only then, Blind Search shows you the engine used for each column.

Sure to be a fun ice-breaker in the hotel lobby at the First Conference on Open Access Scholarly Publishing, 14th – 16th Sept 09, Sweden.