Google to snooper-proof its searches

17 Monday May 2010

Posted by futurilla in JURN's Google watch

≈ 1 Comment

Widely reported today — Google’s Marissa Mayer, vice president of search products and user experience, has stated that…

“next week we will start offering an encrypted version of Google Search”

SurfClarity export/import – how to do it

19 Monday Apr 2010

Posted by futurilla in How to improve academic search, JURN tips and tricks, JURN's Google watch

≈ 1 Comment

How to export your personal SurfClarity list of websites (sites you’ve banned from appearing in Google search results). In the Firefox address bar type:

about:config

Scroll down to the line for: extensions.surfclarity.patterns Look along the line and you find the list of URLs you have entered in SurfClarity. They can be copied out as plain text. You can also modify them.

Optimising scholarly papers for Google Scholar

30 Saturday Jan 2010

Posted by futurilla in Academic search, JURN's Google watch, Spotted in the news

≈ Leave a comment

Oh dear, now arriving in academia — the dubious ‘art’ of optimising texts to temporarily rank highly in search-engines. Testament to the power of Google, I suppose. Optimizing Scholarly Literature for Google Scholar & Co. (PDF link).

Google Scholar new feature – highlights free PDFs

29 Tuesday Dec 2009

Posted by futurilla in JURN's Google watch

≈ Leave a comment

A small addition at Google Scholar. They’re now pulling out a right-hand identifier for the few free PDFs that appear in search results, to distinguish them from paywall PDFs…

Although it does fall down on muse.jhu.edu — which is effectively a paywall site, albeit a worthy one. And I did see one such link for emeraldinsight.com (although most emerald links didn’t have it), so the feature obviously still needs a little more tweaking.

Of course, if you use JURN then you can assume that everything in the results is free. Heh.

Google in the next 10 years

14 Monday Dec 2009

Posted by futurilla in How to improve academic search, JURN's Google watch

≈ Leave a comment

Devindra Hardawar at Pingdom cranes his neck out from the 10 second time-horizon of Planet Twit, and offers an informed view on where Google might be in ten years time. After Christmas, Google will partly be ranking on how speedily a site responds, so it’s interesting to hear Devindra mention the new fast Google Public DNS service. The gist of his suggestions are:

* faster javascript;
* faster browsers;
* faster DNS;
* better HTML (version 5) leading to better faster online applications;
* dirt-cheap or free internet access, subsidised by private companies;
* Android dominates mobile devices, leading to VOIP phones;
* Google at the speed-of-light (approaching 1/10th of a second, in search response time).

But, as always, the curation problem may remain fairly intractable…

“Their problem won’t be gathering all the data, it’ll be making sense of it […] it’ll be interesting to see how they tackle the rest of the upcoming deluge.”

Part of this problem is the lack of search skills among the general population. Many people have a hard time self-curating, partly because of problems with search skills. Part of the solution might be for Google to offer a robust and beautifully-designed interactive search-skills online tutorial and test. It might be adaptive/morphing, to prevent cheating.

Anonymous Google CSEs

07 Monday Dec 2009

Posted by futurilla in JURN's Google watch, Spotted in the news

≈ 2 Comments

There’s a newly released Firefox addon, called Google Custom Search 1.1.2 and made by Kai Londenberg. It creates independently-hosted anonymous Google CSEs, which you can manage and refine from your Google search results / browser. Although it uses the Google API, your engine’s data appears to be stored anonymously on a server in Europe…

“A Google Account is not required anymore, Custom Search Engines can be stored anonymously on quicksear.ch”

Basically, using this addon gives you a seamless melding of the normal Google results format with the major configuration possibilities of a CSE. It’s Google’s SearchWiki on steroids, in an exo-skeleton.

But I don’t see any way to backup your CSE’s XML annotations file of URLs, which means it would be rather risky to invest large amounts of time building a subject-specific CSE this way, rather than using Google’s own interface. Perhaps a backup option will appear once the quicksear.ch site goes live — the addon and service are currently very new, having seemingly been live since September.

There’s no way to upload a “big list ‘o URLs” in the traditional manner, and have them automatically boosted in the CSE’s search rankings. Your CSE is currently a “add one URL at a time” job, as you surf the search results day in and day out. Which perhaps gives your CSE some interesting anti-spam/anti-SEO features, if your CSE is to be used as a mass collaborative anonymous engine (which it apparently can be — tick “accept volunteer contributions” when creating your CSE). And it doesn’t seem to include Google Books results, even when you tell it to include them and boost their rating by 100%.

You currently lose Google’s new “Options…” sidebar, when searching via your quicksear.ch CSE addon (which appears along with the others, in Firefox’s top-right mini search box).

Just like the official Google CSEs, you get cut-and-paste HTML code, which lets others try out your CSE without needing to log in or install anything. I created a new experimental CSE titled JURN collaborative, with permissions for collaborators, but how collaborators contribute to it is currently a mystery.

Update: it seems that to collaborate you would have to share your quicksear.ch password with your collaborators.

How to extract a CSV list of search-result URLs, along with their anchor titles

06 Sunday Dec 2009

Posted by futurilla in JURN tips and tricks, JURN's Google watch

≈ 7 Comments

In this simple tutorial I’ll show you how to rip a page of search result links into a .csv file, along with their link titles, using nothing more than Notepad and a simple bit of javascript.

(Update: January 2011. This tutorial superseded by a new and better one)

1) Have Google run your search in advanced mode, selecting “100 results on a page”. If you prefer Bing, choose Preferences / Results, and select “50 on a page”.

2) Run the search. Once you have your big page o’ results, just leave the page alone and save it locally — doing things like right-clicking on the links will trigger Google’s “url wrapping” behaviour on the clicked link, which you don’t want. So just save the page (In Firefox: File / Save Page As…), renaming it from search.html to something-more-memorable.html

3) Now open up your saved results page in your favourite web page editor, which will probably add some handy colour-coding to tags so you can see what you’re doing. But you can also just open it up in Notepad, if that’s all you have available. Right click on the file, and “Open with…”.

4) Locate the page header (it’s at the very top of the page, where the other scripts are), make some space in there, and then paste in this javascript script…

A hat-tip to richarduie for the original script. I just hacked it a bit, so as to output the results in handy comma-delimited form.

5) Now locate the start of the BODY of your web page, and paste in this code after the body tag…

Save and exit.

6) Now load up your modified page in your web browser (I’m using Firefox). You’ll see a new button marked “Extract all links and anchor titles as a CSV list”…

Press it, and you’ll get a comma-delimited list of all the links on the page, alongside all the anchor text (aka “link titles”), in this standard format…

Highlight and copy the whole list, and then paste it into a new Notepad document. Save it as a .csv file rather than a .txt file. You can do this by manually changing the file extension when saving a file from Notepad.

7) Now you have a normal .csv file that will open up in MS Excel, with all the database columns correctly and automatically filled (if you don’t own MS Office, the free Open Office Calc should work as an alternative). In Excel, highlight the third column (by clicking so as to highlight its top bar) , then choose “Sort and Filter” and then “A-Z”…

You’ll then be asked if you want “Expand the selection”. Agree to expansion (important!), and the column with the anchor text in it will be sorted by A-Z. Expansion means that all the columns stay in sync, when one is re-sorted like this.

Now you can select and delete all the crufty links in the page that came from Google’s “Cached”, “Similar”, “Translate this page” links, etc. These links will all have the same name, so by listing A-Z we’ve made them easy to delete in one fell swoop.

8) You’re done, other than spending a few minutes ferreting out some more unwanted results. Feel free to paste in more such results from Bing, de-duplicate, etc.

If you wanted to re-create a web page of links from the data, delete the first column of numbers, and then save. Open up your saved .csv in Notepad. Now you can do some very simple search and replace operations, to change the list back into HTML…

(Note: you can also use the excellent £20 Sobelsoft Excel Add Data, Text & Characters To All Cells add-in for complex search & replace operations in Excel)

Ideally there would be free Firefox Greasemonkey scripts, simple freeware utilities, etc, that could do all of this automatically. But, believe me, I’ve looked and there aren’t. Shareware Windows URL extractors are ten-a-penny (don’t waste good money on them, use the free URL Extractor), but not one of them also extracts the anchor text and saves the output as .csv.

Yes, I do know there’s the free Firefox addon Outwit Hub, which via its Data / Lists … option can capture URLs and anchors — but it jumbles everything in the link together, anchor text, snippet, Google gunk, etc, and so the link text requires major cleaning and editing for every link. Even with the hit-and-miss home-brew scraping filters, it’s not a reliable solution.

filetype:pdf working in Google Scholar

02 Wednesday Dec 2009

Posted by futurilla in JURN's Google watch

≈ Leave a comment

Oh, this is interesting. filetype:pdf is now working in Google Scholar. It used to be ignored. Using it seems to filter out citation-only records. Results are still cluttered with paywall Springer / Oxford / Sage / Muse etc results — those services will happily send a PDF which will always fail to open on a home connection, presumably due to encryption — but the results are noticeably different and give a better chance of obtaining full-text articles.

Google supporting filetype:epub

03 Tuesday Nov 2009

Posted by futurilla in JURN's Google watch

≈ Leave a comment

Google is now supporting the filetype:epub search modifier, for finding ebooks in the popular epub format. Google has fairly limited coverage of such files so far, at a reported 54,000 hits. Searches for Iliad and Wonderland only show a few epub editions of each.