• Directory
  • FAQ: about JURN
  • Group tests
  • Guide to academic search
  • JURN’s donationware
  • Links
  • openEco: titles indexed

News from JURN

~ search tool for open access content

News from JURN

Category Archives: How to improve academic search

‘Search Query Ambiguity’

14 Tuesday Jul 2009

Posted by futurilla in How to improve academic search

≈ Leave a comment

An interesting-sounding new book suggests new ways of enabling better search-engine experiences, by presenting search-results differently according to the ambiguity of the search (i.e., show the results differently depending on what type of dummy the user seems to be). Search Query Ambiguity (June 2009) looks at how…

“Web search-engines currently do not guide users to construct less ambiguous (i.e., better) search queries, and do not sort results [ usefully ]. […] This book provides new methods of presenting and sorting search results based on search query ambiguity, without resorting to slow-loading and white-spaced-filled graphical methods […] three methods of information visualization and of sorting results are analysed in the environments of both single-term and multi-term search queries”

Although, as I wrote recently on this blog, this may be thinking about things the wrong way round — and may also not be practical due to the strain on back-end computational resources at the Google server farms.

We might instead use browser-embedded individual ‘search-profiles’ to silently shape the search terms and modifiers on-the-fly, in the browser, before they even hit the engine.

An interview with your browser

28 Sunday Jun 2009

Posted by futurilla in How to improve academic search, My general observations

≈ 1 Comment

With the release of the supposedly whippet-fast Firefox 3.5 just two days away, I’m wondering why browsers don’t do a short ‘search profile interview’ when they install. Rather like online dating ‘interviews’, I suppose, but with Google as the object of your affection rather than a Gordon/Gloria.

Then, on certain types of searches (i.e.: the vague ones) your browser would ping Google your carefully-considered ‘search profile’, and presto! — better search-results.

For example, an art historian doing a vague search for samuel palmer shoreham would never have to see results from dodgy poster websites, because the browser profile would say “my user is interested in art history and books and articles containing references”, and Google might also say “samuel palmer was a notable artist whose work is out-of-copyright”, and thus the modifiers -posters -framing -delivery would automatically be added to such a search, and pages with proper academic references would get a boost in the results.

Whereas the person whose browser profile said “frequently spends on home furnishings, subscribes to Homes & Gardens” will get the poster and prints websites pushed to the top, and the 50,000-word thesis on Christian visionary symbolism pushed to the bottom.

Yes, you could have removed those results manually (*) if you’re logged into Google, but you can only do that after the search. And most ‘vague’ searches will happen on searches that don’t tend to repeat themselves often.

( * I had about four poster-sales site in the first two pages of that search, yet I’m logged into Google and have been searching for academic stuff for months — Google seems to have learned little about what I want)

Privacy issues? Well, yes. But what if the browser could seamlessly re-configure a users’s vague search terms, based on their personal profile and known interests, before the query is sent to the engine? Think “search suggestions” on steroids, and without any annoyingly dumb flickery drop-down boxes that don’t have a clue about my interests.

URLs to icons

21 Sunday Jun 2009

Posted by futurilla in How to improve academic search, My general observations

≈ Leave a comment

If there was a Firefox addon that converted URLs to icons…

url-to-icon

There isn’t, sadly. 🙂 < of course, as you can see here, WordPress already has similar functionality — replacing a text-smiley with an icon.

OutWit Docs

20 Saturday Jun 2009

Posted by futurilla in Academic search, How to improve academic search, JURN tips and tricks

≈ Leave a comment

Have you ever wanted to rip all the PDF and DOC files from a focussed Google or Google Scholar search, quickly save them all to a folder, index them with something powerful like dtSearch, and then search the real full-text from across all of them — rather than whatever bits the Googlebot indexed as it swept past, and whatever bits the Google Search ran its search from?

Or archive the entire run of a PDF ejournal that’s sitting at site:www.our-ejournal/articles/ ?

The new free OutWit Docs Firefox plugin does that, and works with the latest version of Firefox. There’s one major drawback — it hijacks the space right next to your browser’s Home icon, with a naff shiny 3D-stylee icon…

ugly

Unacceptable. It can however be moved after a bit of fiddling (Right-click, ‘customize’, and drag it out) and then placed somewhere a little more suitable and out-of-sight.

When using it, though, you also quickly come to appreciate why people should name their academic PDF files something_meaningful.pdf rather than xy2f6fjg00.pdf  And why filenames should have year rather than month first…

pdfnames

As a severe test of what after all is a mere 0.1.0.20 app, it took 9 minutes to whisk through 90 years worth of Field Artillery journal (1911-2007), running from a Google search of site:sill-www.army.mil/FAMAG/ , to find 800Mb in 996 PDF files, and to then start to download them. This was, of course, the point at which I wanted OutWit to have a big red STOP button, although quitting the app did the trick.

Seven things any new ejournal should consider

16 Tuesday Jun 2009

Posted by futurilla in How to improve academic search, JURN tips and tricks, My general observations, Open Access publishing

≈ Leave a comment

Having recently got up close and personal with thousands of ejournal URLs, here are seven suggestions for those who are considering launching an independent open ejournal in the arts and humanities.

1.   Register your own domain name. Try to make it human-readable and meaningful — e.g.: www.fabric-artists.org rather than using initials or shortened forms such as www.f-art.com. Pay for the domain and all hosted server costs up-front, for at least ten years, with a reliable commercial web hosting provider. This should not cost you more than about £600. Expensive, but it means that the university IT techies can’t capriciously juggle the root URL and thus break all your inbound links. Store all parts of the journal at your domain, calling no core content in from off-site, or from “slightly-different” URLs.

Problems solved: a) countless dead “404” links in ejournals list and directories just a few years old, and a circa 80% attrition rate on those more than five years old; b) a niche academic search-engine indexes your home page URL, but doesn’t also index the articles because you’ve stored them at a different URL.

2.   Consider using the URL and file name as a carrier for some basic metadata, including clearly indicating if the content is free or pay. For instance…

   www.technology-history.org/journal-issue-004/free-full-text/2009_adams_preindustrial_water_mills.html

Where preindustrial_water_mills are the first three words of the article title.

Without even accessing the document, a human can now glance at the URL in search results and read off:

   Journal name (Technology History)
   Issue number (Number 4)
   It’s from a journal
   It’s free full-text
   The year published (2009)
   The author surname (Adams)
   The first three words of the article title (“preindustrial water mills“)

As you can see, that’s much more useful than having something impenetrable such as:   www.hupt.stetford.edu/caij/admin/contentimages/38-02-106_h894.html   and far better than having a huge database-driven scripted URL. You’ll exclude common words such as ‘the’ from the article title, obviously.

Problems solved: a) a useful range of basic metadata is not automatically displayed alongside a link to the journal article, other than the title (if you’re lucky) and an often-misleading text snippet; b) users accessing via a standard public search-engine have to download and manually open your article file to find out simple things like when it was published and if it’s really free full-text.

3.   Don’t hang admin pages directly off the main URL. Put them in their own folder, e.g.: www.full-journal-name.org/editorial-files/our_editorial_board.html

Problem solved: Indexing the main domain also brings in all sorts of administrative fluff, old conference flyers, etc

4.   Publish in HTML, as well as in PDF.

Problem solved: PDF is print-oriented (so consider linking each issue to a POD book publisher such as Lulu), but with HTML people can do more interesting things with (like browser addons that auto-detect and auto-link citations on a page)

5.   Make sure all your articles contain basic information like: the journal title, issue number, and ideally your home-page URL in clickable form. Put this in the body text of the article. Also make sure your PDF file properties are all filled out correctly, as are your HTML headers. It’s just basic marketing really, but also useful for those who would organise knowledge.

Problem solved: A downloaded article from an open access ejournal very often has no embedded data giving the full journal title and issue number. Future generations won’t thank a researcher for telling them, “um yeah, but I once had that stuff via my personal copy of Zotero”.

6.   Zero tolerance for broken URLs and 404 errors. Never ever let your IT techies or web designers change your directory structure once it’s set. If they really have to for some world-shattering technical reason, then make sure you force them to set up durable (five-year minimum) working redirects for every article, or use some server magic to make the new structure look like the old structure to the outside world.

Problems solved: a) too many dead “404” links in ejournals directories just a few years old; b) blogs, discussion forums have many broken direct links to journal articles they’re discussing; c) there are even sometimes broken links on the journal website itself(!) caused by directory-juggling.

7.   Publicise. There’s nothing more disheartening than doing a Google search for link:www.your-established-ejournal.org — and finding that the only people who link to it are your university and a lone blog post from 2006. Being a journal on an obscure topic doesn’t mean you should be invisible. Google will bury you if you don’t have any inbound links, and (I would imagine) your authors may drift away if no-one links to or reads their articles. There’s also a whole planet out there, and the next expert in hyperkinetic light-art might be a kid sitting in a bush college in Uganda. She needs to find your excellent new article giving an overview of hyperkinetic light-art.

FireCite

16 Tuesday Jun 2009

Posted by futurilla in How to improve academic search

≈ 1 Comment

Andy Hong already has a page for his 2009 undergraduate dissertation, titled “FireCite: A Browser Extension for Citation Recognition and Management” (2009). Not yet online, it seems, but there’s an abstract…

“This dissertation describes FireCite, a Mozilla Firefox extension that incorporates a citation parser and citation recognition. The citation parser is fast, lightweight, and can parse citations from HTML web pages with an overall F-measure of 0.878. Yet it can also parse plain-text citations with an overall F-measure of 0.97, comparable to larger and more complex parsers. The citation recognizer is also fast and lightweight with a high recall of 96%. FireCite proves that it is possible to perform citation recognition and parsing with real-time response and satisfactory accuracy. FireCite itself is able to recognise citations from any web page and extract basic metadata from them.”

Minutes of the development process + background papers | Latest version (0.501 with source code, 7th June 2009 — adds .ac to automatically processed domains)…

“As you surf, this extension detects citations on the webpage. You then have the option to save the information to a reading list, along with any attached PDF file.”

It seems to get confused (reads too many non-citations as if they were citations) on some types of pages, and it’s very basic. But it’s an interesting proof-of-concept for automatic finding/reading of citations on Web pages that are “in the wild” — compared to the popular Firefox addon Zotero which needs to find a “Zotero-friendly” website such as Google Scholar or Amazon in order to do something similar.

Common Tag and Search BOSS

13 Saturday Jun 2009

Posted by futurilla in Academic search, How to improve academic search

≈ Leave a comment

This looks somewhat interesting. Just launched, Common Tag…

“is an open tagging format developed to make [ Web ] content more connected, discoverable and engaging. Unlike free-text tags, Common Tags are references to unique, well-defined concepts, complete with metadata and their own URLs.”

From what I read, it sounds a bit like herding cats — attempting to persuade (firstly) bloggers and social bookmarkers to use standardised vocabularies and terminology for content tagging. I suspect it’ll find difficulties in gaining traction, simply due to the sheer size of the Web. Nice logo, though…

commont

It would be interesting to see an academic version, which could auto-read a document and suggest and automatically embed (microformat or RDFa?) tags using the A&AT terms.

And I just found out about the Yahoo Search BOSS, which seems to have been around in mature form since late 08. It’s Yahoo’s competitor to Google CSE. It seems to have appeared during their recent takeover troubles, which doesn’t inspire confidence. However, it’s getting new features and appears to be under active development. New sorting functions have apparently been added to BOSS, offering sorting by date and/or a specified time range (although it seems that may be limited to custom News search?). There’s also a Python-driven mashup feature, although at present people seem to be using this to add rather naff-looking context-aware sidebars alongside search-results. There’s also a kicker in the small print…

In the near future, we will be introducing a fee structure for BOSS

If sorting by date was a feature that could be added to Google CSE results, and a keyword-targetted RSS feed was then allowed to run from that sorting, JURN could feed you a usable approximation of a rolling keyword-specific table-of-contents alert from 3,000+ ejournals. Does the current standard open access ejournal publishing software allow that sort of cross-journal alerting service, I wonder?

Open access search?

12 Friday Jun 2009

Posted by futurilla in Academic search, How to improve academic search, My general observations

≈ 1 Comment

Following on from my previous post… a search for “open access” site:www.google.com/coop/ was discouraging. There are about twenty “living-dead” Custom Search Engines from 2006, but no large ones updated after 2006 (so far as I could tell from a quick visit).

Pouring out all this open access content is all very well, but where’s the competition and development in open access search?

And where are the simple common standards for flagging open content for search-engine discovery and sorting, for that matter? Judging by the structure and look of most academic repositories, internet search-engines are the last things on their minds.

Now of course I’m viewing things from the outside, as an independent curator and social entreprenuer, not a librarian or OA evangelist. But it seems to me that burying your PhD thesis deep in a repository cattle-car — seemingly with only a few keywords, an ugly template and an impenetrable URL for company — isn’t serving it or the author very well. Especially in terms of metadata and tagging leading to full-text search discovery. As the authors of “Experiences in Deploying Metadata Analysis Tools for Institutional Repositories” recently wrote in Cataloging & Classification Quarterly (No. 3/4, 2009)…

“Current institutional repository software provides few tools to help metadata librarians understand and analyse their collections.”

Which doesn’t bode well for search-engines aiming to hook into and sort the same metadata. That sort of statement might have been acceptable in 1999, but it’s a damning statement to hear from librarians in 2009. And another paper in the same issue concludes that there is…

“a pressing need for the building of a common data model that is interoperable across digital repositories”.

Now I wouldn’t know a Dublin Core from a Dublin Pint, but how difficult would it have been to build a search-engine friendly tag that allows a repository to tell the world “this is a root free-to-all full-text file” and “you’re not going to get any full-text for this title”? Or to allow the “one-click” filtering out of science and medical-related OA material across search results from a thousand repositories?

This could be done at the URL level. For example by using a standard universal URL structure that could be read by machines and humans alike. For a journal it might run something like:

   www.technology-history.org/journal-issue-004/free-full-text/2009_adams_preindustrial_water_mills.html

Where preindustrial_water_mills are the first three words of the article title.

Without even accessing the document, a human can now glance at the URL in search results and read off:

   Journal name (Technology History)
   Issue number (Number 4)
   It’s from a journal
   It’s free full-text
   The year published (2009)
   The author surname (Adams)
   The first three words of the article title (“preindustrial water mills“)

For a repository it could look something like:

   www.uni.edu/oa-repository/free-full-text/theses/history/history-of-technology/2009_adams_preindustrial_water_mills.html

And with a uniform standard for URL structures, university IT techies would not be allowed to fiddle with the directory structure and thus break the URL. All full-text files in U.S. repositories could then be searched simply by indexing one line:

http://www.*.edu/oa-repository/free-full-text/

Anyway, rant over. I did find a large Google CSE for Economics. Not much use for the arts and humanities you might think, and last updated in 2006, but due to its sheer size (23,613 sites from apparently reputable sources) searches for…

“creative economy” keyword

“creative industries” keyword

“art market” keyword

… all seem to show it still has some use as a discovery tool.

A sea of CSEs

12 Friday Jun 2009

Posted by futurilla in Academic search, How to improve academic search

≈ 2 Comments

I had a quick look around for other Google Custom Search Engines, via a simple search for:

keyword site:www.google.com/coop/

Living-dead CSEs from circa-2006 litter the results, of course. Probably made in 30 minutes during the first flush of public interest in Google’s new toy, usually indexing less than 30 items, and then seemingly forgotten about within 30 days.

I guess that’s one of the main reasons why people don’t seem to hold specialist Google CSEs in high regard. Which probably helps to explain why a search for 2009 site:www.google.com/coop/ seems to show that only a mere 39 public CSE have either been built or updated in the last six months. It seems a shame that the academic community is fiddling with often-unlovable and quickly-stale niche wikis, while such a powerful tool is all-but unused except for an occasional private one-site index. It’s not as if CSEs don’t have tools for collaborative index-building and weeding.

With a few months of careful work by a professional or subject-specialist, there’s no reason why a CSE can’t hold its head up alongside funded/commercial services — as I hope I’ve shown with JURN. And if a developer plans ahead and uses some common tools, basic maintainance of a large curated engine — once complete — shouldn’t take more than a couple of days of work per year.

I did find a few CSEs in the humanities still showing some stamina…

Theological journal search (340+ titles inc. findarticles.com, last updated Jan 2009).

Online Biblical Studies journals (123 titles, the titles freely listed, last updated 2008).

Judaic Studies in English (278 sites, last updated Sept 2007).

Alcuin Society (139 sites on bibliophilia and book arts, last updated Oct 2008).

AuseSearch (All open access academic repositories in Australia that are listed in Kennan & Kingsley at Feb 2009).

Film Blogs (139 titles, the titles freely listed, last updated June 2009. Looks like a strong tool for quickly finding genuine reviews from film-buffs, as opposed to marketing psuedo-reviews).

Busador Cultural (a large academic-cultural-arts search-engine for Spanish-language material).

So where might there be scope for a strong new curated CSE, with a nice balance of focus and scope? It might be useful to have an engine for “books still of scholarly worth, and other useful non-fiction” which selects from the ebooks that are flooding out from the out-of-copyright book digitisation projects, indexing the full-text. Books such as Tom Wedgwood, the first photographer and Kitecraft and Kite Tournaments. There has to be a more enticing way to access this stuff than getting your keywords tangled in creaky Victorian potboilers and agricultural pamphlets from 1932, or ploughing through a daily list seemingly endlessly populated by thousands of 1920s pulp novels and Victorian romances. But I’m willing to bet that there’s no flag in the metadata which says “non-fiction / just the cool stuff”, so it might take a lot of work.

Blind Search

11 Thursday Jun 2009

Posted by futurilla in Academic search, How to improve academic search, Spotted in the news

≈ Leave a comment

The academic blog Walt at Random tries out a new search tool, Blind Search…

“You type in a search. You get back the first 10 results for each of three search engines, displayed in three parallel columns. You click on one of three “vote for this search engine” buttons, based on the column of results that seem to match your query best. Then, and only then, Blind Search shows you the engine used for each column.

Sure to be a fun ice-breaker in the hotel lobby at the First Conference on Open Access Scholarly Publishing, 14th – 16th Sept 09, Sweden.

← Older posts
Newer posts →
RSS Feed: Subscribe

 

Please become my patron at www.patreon.com/davehaden to help JURN survive and thrive.

JURN

  • JURN : directory of ejournals
  • JURN : main search-engine
  • JURN : openEco directory
  • JURN : repository search
  • Categories

    • Academic search
    • Ecology additions
    • Economics of Open Access
    • How to improve academic search
    • JURN blogged
    • JURN metrics
    • JURN tips and tricks
    • JURN's Google watch
    • My general observations
    • New media journal articles
    • New titles added to JURN
    • Official and think-tank reports
    • Ooops!
    • Open Access publishing
    • Spotted in the news
    • Uncategorized

    Archives

    • February 2026
    • January 2026
    • October 2025
    • May 2025
    • April 2025
    • September 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • June 2023
    • May 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2021
    • September 2021
    • August 2021
    • July 2021
    • June 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
    • October 2014
    • September 2014
    • August 2014
    • July 2014
    • June 2014
    • May 2014
    • April 2014
    • March 2014
    • February 2014
    • January 2014
    • December 2013
    • November 2013
    • October 2013
    • September 2013
    • August 2013
    • July 2013
    • June 2013
    • May 2013
    • April 2013
    • March 2013
    • February 2013
    • January 2013
    • December 2012
    • November 2012
    • October 2012
    • September 2012
    • August 2012
    • June 2012
    • May 2012
    • April 2012
    • March 2012
    • February 2012
    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
    • July 2011
    • June 2011
    • May 2011
    • April 2011
    • March 2011
    • February 2011
    • January 2011
    • December 2010
    • November 2010
    • October 2010
    • September 2010
    • August 2010
    • July 2010
    • June 2010
    • May 2010
    • April 2010
    • March 2010
    • February 2010
    • January 2010
    • December 2009
    • November 2009
    • October 2009
    • September 2009
    • August 2009
    • July 2009
    • June 2009
    • May 2009
    • April 2009
    • March 2009
    • February 2009

    Proudly powered by WordPress Theme: Chateau by Ignacio Ricci.