Bypassing Australian net censorship

21 Friday May 2010

Posted by futurilla in How to improve academic search, Ooops!, Spotted in the news

≈ Leave a comment

How to bypass the Great Censorship Wall of Australia…

“The brief? To teach your grandma to bypass the Internet filter.”

SurfClarity export/import – how to do it

19 Monday Apr 2010

Posted by futurilla in How to improve academic search, JURN tips and tricks, JURN's Google watch

≈ 1 Comment

How to export your personal SurfClarity list of websites (sites you’ve banned from appearing in Google search results). In the Firefox address bar type:

about:config

Scroll down to the line for: extensions.surfclarity.patterns Look along the line and you find the list of URLs you have entered in SurfClarity. They can be copied out as plain text. You can also modify them.

Google in the next 10 years

14 Monday Dec 2009

Posted by futurilla in How to improve academic search, JURN's Google watch

≈ Leave a comment

Devindra Hardawar at Pingdom cranes his neck out from the 10 second time-horizon of Planet Twit, and offers an informed view on where Google might be in ten years time. After Christmas, Google will partly be ranking on how speedily a site responds, so it’s interesting to hear Devindra mention the new fast Google Public DNS service. The gist of his suggestions are:

* faster javascript;
* faster browsers;
* faster DNS;
* better HTML (version 5) leading to better faster online applications;
* dirt-cheap or free internet access, subsidised by private companies;
* Android dominates mobile devices, leading to VOIP phones;
* Google at the speed-of-light (approaching 1/10th of a second, in search response time).

But, as always, the curation problem may remain fairly intractable…

“Their problem won’t be gathering all the data, it’ll be making sense of it […] it’ll be interesting to see how they tackle the rest of the upcoming deluge.”

Part of this problem is the lack of search skills among the general population. Many people have a hard time self-curating, partly because of problems with search skills. Part of the solution might be for Google to offer a robust and beautifully-designed interactive search-skills online tutorial and test. It might be adaptive/morphing, to prevent cheating.

Overcoming barriers: access to research information

09 Wednesday Dec 2009

Posted by futurilla in How to improve academic search, Official and think-tank reports

≈ Leave a comment

Two new December 2009 reports from the UK’s prolific RIN, part of a cluster of five such reports…

1) Overcoming barriers: access to research information (PDF link)…

“This report finds that many researchers are encountering difficulties in getting access to the content they need and that this is having a significant impact on their research.”

“technical limitations such as log in/authentication problems (26%) or problems with proxy servers and off-site access (a particular problem for researchers [seeking to access ejournals] – a majority in the humanities and social sciences – who spend significant amounts of time away from their home institution)”

“The proportions of those who felt the impact [of unavailable ejournal content] as having a ‘significant’ impact on their research were higher in the arts and humanities“

2) How researchers secure access to licensed content not immediately available to them (DOC link, Word)…

“emailing the author directly […] creative searching online, primarily using Google Books and Google Scholar […] accessing cached content; and signing up for free trials with publishers […] buying books online, usually second-hand, when they are unable to get access via other routes.”

[ Hat-tip: Open Access News ]

Widescreen search results

08 Tuesday Dec 2009

Posted by futurilla in How to improve academic search, JURN tips and tricks

≈ 1 Comment

In an age of 24″ widescreen monitors, why do many people stick with a long scrolling page format for search results — more suited to the age of the accounting ledger?

When the results could look like this…

How? Here’s my recipe:

The Firefox web browser, with the GreaseMonkey addon. Then add the Google 100 GreaseMonkey script, and set it to show 24 results per search page. Add the GoogleMonkeyR script, and set it up to show three columns (and to remove clutter such as “Related searches” and “Sponsored Links”).

You’ll never scroll on search-results again.

I’m assuming you already have AdBlock Plus installed on Firefox, to remove all Google text ads. The GreaseMonkey script New Google Ad-block may also be of interest, to block the page-integrated ads that Google is now adding to results.

Google Scholar H-Index for Greasemonkey

08 Tuesday Dec 2009

Posted by futurilla in How to improve academic search, JURN tips and tricks, Spotted in the news

≈ Leave a comment

A new Firefox + Greasemonkey script: Google Scholar H-Index…

“This rough, yet useful, Firefox GreaseMonkey script will enable you to automatically display some of the most known citation indices (h-index, g-index, e-index) for any author queried on Google Scholar. […] The script currently processes just the displayed result page, and, as such, does not currently work for persons having enormous (h or g)-index (h or g > 100).”

I have to admit I’m not entirely sure how such measures work. But I assume that ‘more is better’ in terms of the starting citations needed to take a measurement. So possibly someone will hack it so that it works through 1000 search results, rather than the current 100?

Another new script of interest is Google Scholar Citation Explorer…

“An enhancement for Google Scholar that lets you see which citations a set of papers have in common. Select a group of related papers, even from across searches, and see which papers cite the whole set (or a subset of it).”

Auto-detect language and auto-translate – all browsers should do this

07 Monday Dec 2009

Posted by futurilla in How to improve academic search, JURN tips and tricks, Spotted in the news

≈ Leave a comment

This is rather nice, and seems to have been released in the last few days. A new Chinese Language translation add-on for Firefox, where the language of the web page is auto-detected and the translation happens seamlessly within the existing page layout. There’s no messing around with tedious right-clicking, highlighting, hovering over buttons, etc. This is one of the first of many such add-ons, I would hope. Future browsers should have this built in, for all the major languages.

The only problem at present it that it’s rather too seamless. Users need a little visual flag to show when it’s been applied to a page. And perhaps a “toggle” button.

Ten problems with Google Custom Search Engines

03 Thursday Dec 2009

Posted by futurilla in How to improve academic search, My general observations

≈ 3 Comments

1) Google often doesn’t seem to index quite everything at a site. Nor does it always index everything on a page or in a PDF file. Or perhaps it does index everything, but the algorithm that shapes each set of search results jettisons a few results for various reasons? The other possibility is that Google’s results are drawn from a pool of ‘shards’ of previous results, rather than direct from the core crawl data.

Solution: Google “Caffeine” and subsequent revamps?

2) Results from the main Google search can sometimes differ from those in your CSE. Your CSE will occasionally give radically less results from a site than the main Google does. Google doesn’t explain why this is, or the mechanism behind it. Perhaps there are several different versions of the Google index. Results are often much better when using a more sophisticated search method than simple keywords, searching “for phases” for instance. Sometimes you have to give up on trying to get your CSE to “see” the PDFs you want (although these are visible to the main Google) — and instead find a way to index just the linked table-of-contents pages (which will usually show up in your CSE).

Solution: A lot of extra work. Google could offer a “full Google” CSE to worthy non-profits.

3) Academics love to store the real content at some location that has a different URL than their home-page does. An unoptimised CSE may thus index a website containing ten pages, but not the 10,000 articles that they point to.

Solution: A lot of extra work, of the sort that JURN has undertaken, to find and then optimise the real “content location URL”.

4) Initial URL gathering can be arduous. Techies and web editorial staff at universities love to juggle directory structures, often for no discernible reason, and thus break links. Link-rot is severe in ejournal lists from more than two years ago, and lists over four years old often have around 80% dead links.

Solution: Techies need to set up robust redirects if they really have to break URLs. “Self-destruct tags” that delete a links-list page after a certain date, if it hasn’t been updated for more than two years.

5) Google CSEs cannot pick specific content (e.g.: a run of journal issues) from the meaningless database-driven URLs commonly found in academic repositories, since there is no repeating URL structure to grab onto. It’s a question of indexing “all or nothing”.

Solution: URL re-mapping services that are recognised and can be “unwrapped” by Google? Plain HTML “overlay” TOCs.

6) Editors don’t enforce proper file-names on published documents, which means many CSE search results are titled in the Google results as something like “&63! print_only sh4d7gh.indd” rather than “My Useful Title”. Nor do people add the home location URL and website title to the body of their document — which means that scholars can waste several minutes per article trying to find out where it came from. Some students may never manage to find the journal title for the article they downloaded.

Solution: Better publication standards at open access and independent ejournals.

7) Large Google CSE are easy to make, but take a lot of hand-crafting to properly optimise and maintain. “Dead” CSEs from late 2006, when the CSE service first appeared, litter the web. Most of these were also un-optimised. Despite the potential of CSEs, it’s really hard to find large subject-specific CSE that are both optimised and maintained. Most people now seem to use CSEs for indexing a single site or a small cluster of sites that they own.

Solution: Users should remove old circa-2006 CSEs from the web. Subject-specific academic and business groups should consider building a collaborative CSE rather than a wiki.

8) Google’s search result ranking doesn’t work as well as it might in tightly defined academic searches. The PageRank wants to evenly “spread the results” across a variety of sites, and thus you’ll rarely see results from just one site dominating the first ten hits – although that may be exactly what a tight academic search requires.

Solution: For some types of CSE, this could probably be solved by delving into the optimisation features that Google offers for linked CSEs. Update: Google appears to have tweaked the algorithms to fix this problem.

9) Google searches have a problem with finding text at the end of long article titles, of the kind which are common in academia.

Solution: Authors and publishers should work to keep article and page titles under 50 characters.

10) You can’t have your CSE do a “search within search results”.

Solution: Manually build a set of pages containing the result URLs you want indexed, then get Google to see these as static pages which can then be added to your CSE.

Tenurometer

02 Wednesday Dec 2009

Posted by futurilla in Academic search, How to improve academic search, Spotted in the news

≈ Leave a comment

Tenurometer is a Firefox addon that works with Google Scholar…

“to facilitate citation analysis and help evaluate the impact of an author’s publications.”

Sadly the makers of the addon are dangerously wrong, in writing that…

“Google Scholar provides excellent coverage”

Scholar provides only very marginal coverage of several thousand independent and open access titles in the arts and humanities. Another problem might arise from the fact that it also indexes repositories and home-pages, as well as journals. Further problems with using Google Scholar for assessing impact have been discussed elsewhere by others.

One other thing that goes unexplained is how to access Tenurometer once you’ve installed it. It’s an addon that’s counter-intuitively accessed under the “View” menu rather than “Tools”/Add-ons. To turn it on you need to go to…

Then you get…

You need to type “p” to get a drop-down predefined list of subject tags.

At the moment, it’s painfully slow — taking over a minute to process a simple History subject area query for author Klaus Graf. Finally, after six erroneous pages of medical papers Tenurometer offered a correct link to: “Reich und Land in der sudwestdeutschen Historiographie um 1500”. The “filter results by subject area” option still needs some heavy work, it seems.

Do we need a new CSE for repositories?

30 Monday Nov 2009

Posted by futurilla in Academic search, How to improve academic search, My general observations

≈ 3 Comments

Do we need a new Google CSE for academic repositories? The old ones are looking rather long in the tooth, and their link-rot must be getting pretty bad by now.

Open DOAR search, according to the date on the foot of the search page, has not updated since Nov 2006. Similarly, ROAR‘s own Google Custom Search Engine has not been updated since Nov 2006.

I think it’s time for a new and up-to-date one. It shouldn’t be difficult to extract the URLs from a downloaded set of OpenDOAR country pages, which are still actively maintained. It’s even easier to download the .csv of all the URLs from ROAR and to extract them with Excel. As with OpenDOAR, it seems that the ROAR repository list is up-to-date, even if the CSE isn’t. One would then combine the lists and de-duplicate, clean the list, and then upload the cleaned list to a sparkly new Google Custom Search Engine. If I had the space to add another 2,000 URLs to my Google CSEs, I’d do it myself.

News from JURN

~ search tool for open access content

Category Archives: How to improve academic search

Bypassing Australian net censorship

SurfClarity export/import – how to do it

Google in the next 10 years

Overcoming barriers: access to research information

Widescreen search results

Google Scholar H-Index for Greasemonkey

Auto-detect language and auto-translate – all browsers should do this

Ten problems with Google Custom Search Engines

Tenurometer

Do we need a new CSE for repositories?