Competition for the Google CSE

04 Tuesday Jan 2011

Posted by futurilla in How to improve academic search, JURN's Google watch, Spotted in the news

IndexTank, custom search in a box. Nice idea. But it seems to be aimed at individual business looking to reduce their IT overheads, and is useless as a replacement for a Web-wide Google CSE…

“IndexTank doesn’t actively fetch data from you as a web crawler would do. Instead, your application sends IndexTank the data as soon as it is created or updated”

“not a standalone web search engine, and we don’t currently have a way for you to set it up directly through the Web. It requires downloading software such as a WordPress plugin (if you wanted to add better search to your blog, for example) or writing a program to interact with our servers.”

Worse, it can’t even auto-extract indexable text from the PDFs you send it…

“IndexTank, like other full-text search alternatives, indexes only text. However, for common formats like PDF or Word, it is very easy to parse them to obtain the readable text by using open source tools.”

I should mention some of the other ‘sort-of’ search-in-a-box options.

* The old and vulnerable (in the light of the Delicious closure) Yahoo BOSS

* Spinn3r. But it can only supply “A-list” blog content (so possibly not much use for hyperlocal indexing of a city-region), and you have to build your own widget to hook into its API.

* 80 Legs is a pricey monthly-subscription web-crawler. I’m uncertain if their stated ‘URL limit’ refers to the number of URLs on the originating site-list, or the number of files actually found by their crawler. If it’s the latter, you could run out of space very fast.

* And of course the new Blekko, which lets you upload a text file full of your selected URLs, and then uses them to create a ‘slashtag’ that delimits people’s searches. The last one is interesting, and I might eventually have a play around with it. Although possibly that’ll be when you’re no longer limited to 1,000 URLs, and are allowed to use wildcards in the URL list.

It’s great to see some competition emerging to Google CSEs, and perhaps it will eventually spur Google into offering a commercial ‘Deep’ Web-wide version of the Custom Search Engine:— full-text deep indexing of all the documents found at any website it’s pointed at; all the documents found are drawn on to produce your custom search results, every time; and the user gets 12,000 URLs to play with. Or perhaps Microsoft Bing will offer such a service. It might be limited to non-profits, so as to keep the SEO spivs out.

Ropey repositories

30 Tuesday Nov 2010

Posted by futurilla in How to improve academic search, My general observations

≈ Leave a comment

It’s always been annoying that academic repositories jumble together paywall / no-access / open access material, and don’t allow users to search only for open access + full-text materials. With a very few honourable exceptions, it’s a ridiculous situation — and the so-called library professionals involved in the development of such ‘standards’ should be hanging their heads in shame. Bibliographic Wilderness agrees…

“Really, I’m deeply disappointed that this kind of thing — good metadata that will allow software to know if an item really is OA, and to get a link directly to the content as well as the landing page — doesn’t seem to be a concern of the repository communities. This has been a problem for YEARS, and if any of the various organizations involved in this stuff are even making any efforts to address it, I haven’t heard about it.”

Full text vs. abstracts

27 Saturday Nov 2010

Posted by futurilla in How to improve academic search, Spotted in the news

≈ Leave a comment

Jimmy Lin’s “Is searching full text more effective than searching abstracts?“. Conclusion…

“Users searching full text are more likely to find relevant articles than searching only abstracts.”

How to import/export a list of banned URLs from the Google Noise Reduction script, for Firefox + GreaseMonkey.

20 Saturday Nov 2010

Posted by futurilla in How to improve academic search, JURN tips and tricks, JURN's Google watch

≈ 2 Comments

You may have spent some time building up a list of banned URLs for the Firefox addon Surfclarity, which strips unwanted domains from Google Search Results. Surfclarity no longer works with the latest Google changes, but the Greasemonkey script Google Noise Reduction does. In this tutorial we’ll swop the Surfclarity blacklist into the Google Noise Reduction blacklist.

1. In Firefox’s address bar, type: about:config.

2. Scroll down to extensions.surfclarity.patterns

Double click on the line of banned URLs you’ll find there, and copy them to Notepad.

3. Scroll further down to greasemonkey.scriptvals.http://exego.net//Google Noise Reduction.blacklist and take a look at the format. Note that it’s a little different than Surfclarity…

({‘britannia.com’:true, ‘oxfordjournals.org’:true, ‘tandf.co.uk’:true, ‘ingentaconnect.com’:true, ‘sagepub.com’:true, ‘myspace.com’:true, ‘experts-exchange.com’:true})

So we’re going to have to do some basic search-and-replace on our Surfclarity blacklist. Back up the Google Noise Reduction.blacklist if you want, as we’re going to overwrite it in a few moments.

4. Go back to Notepad and look at the list of Surfclarity URLs you just copied out.

Search for : and replace with : ‘ — note the space after the “:”.

Then search for : and replace it with ‘:true,

Now add ({‘ to the very start of this list, and ‘:true}) to the very end of this list.

Congratulations, you now have your SurfClarity list in Google Noise Reduction format.

5. Copy your new list to the clipboard, go back to greasemonkey.scriptvals.http://exego.net//Google Noise Reduction.blacklist, clear what’s in there at the moment, and then paste the new list in. You’re done.

Obviously, you can now also copy a backup of the Google Noise Reduction.blacklist

Carrot 2

16 Saturday Oct 2010

Posted by futurilla in How to improve academic search

≈ Leave a comment

Carrot2 is an open source software for finding thematic clusters in groups of documents…

“It can automatically organize [and label] small collections of documents, e.g. search results, into thematic categories. Apart from two specialized document clustering algorithms, Carrot2 offers ready-to-use components for fetching search results from various sources including YahooAPI, GoogleAPI, Bing API, eTools Meta Search, Lucene, SOLR, Google Desktop and more.”

“Carrot2 came about as a framework for building search-results clustering engines but its algorithms should successfully cluster up to about a thousand text documents, a few paragraphs each”

Scholarly Publishing through Open Access: A Bibliography

10 Sunday Oct 2010

Posted by futurilla in Economics of Open Access, How to improve academic search, Official and think-tank reports, Open Access publishing

≈ Leave a comment

A comprehensive new 2010 bibliography, Transforming Scholarly Publishing through Open Access: A Bibliography.

“…has over 1,100 references, provides in-depth coverage of published journal articles, books, and other works about the open access movement. Many references have links to freely available copies of included works.”

Association of Learned and Professional Society Publishers – 2010 proceedings

23 Thursday Sep 2010

Posted by futurilla in How to improve academic search

≈ Leave a comment

A set of podcasts and Powerpoint slides the Sept 2010 conference of the Association of Learned and Professional Society Publishers.

Including:

* The Seven Crises of Scholarly Publishing : extinction or evolution?
* The Library : the best place for information research?
* Needles in a Virtual Haystack : discoverability as a route to market

ScholarLynk

08 Wednesday Sep 2010

Posted by futurilla in How to improve academic search

≈ Leave a comment

Details of a new prototype tool from Microsoft Research: ScholarLynk…

“ScholarLynk is a desktop solution aiming to support researchers in building and maintaining ‘reading lists’ of resources in collaboration with other researchers […] tools for (i) constructing reading lists by tagging the desired resources, (ii) seamlessly incorporating remote data sources as desktop resources, and (iii) supporting in-context communication, sharing of reading lists, and collaboration with other users of the ScholarLynk.

The prototype implementation leverages the DRIVER Infrastructure for European Open Access [repository] publications that currently comprises 2,500,000 publication records from over 250 repositories world wide.”