Moving from Google Noise Reduction to Google Hit Hider

22 Tuesday Mar 2011

Posted by futurilla in JURN tips and tricks, JURN's Google watch

Firefox 4 final is out now. Sadly it breaks the Greasemonkey script Google Noise Reduction, which was an excellent per-domain results blocker for Google Search.

However, the new and powerful Google Hit Hider does work very well, and is very similar. It’s obviously learned a lot from earlier software like Blocksite, Surfclarity, and Noise Reduction (all of which no longer work with FF4 / the latest Google) and there are some nice refinements. Not the least of which is very easy import/export as simple plain-text lists of URLs.

It’s a fairly simple process to get your hand-crafted Noise Reduction blocklist out of Firefox and into Google Hit Hider…

1. In Firefox’s address bar, type: about:config

2. Scroll down to greasemonkey.scriptvals.http://exego.net//Google Noise Reduction.blacklist You’ll see…

({‘britannia.com’:true, ‘oxfordjournals.org’:true, ‘tandf.co.uk’:true, ‘ingentaconnect.com’:true, ‘sagepub.com’:true, ‘myspace.com’:true, ‘experts-exchange.com’:true})

3. Double click on the line of banned URLs you’ll find there, and copy them to Notepad.

4. Now just top-and-tail the list, then search and replace until you have a clean list, but leave each URL separated by a single comma. Save the list as a .csv (comma separated value) file, then open that with MS Office’s Excel (or whatever the free Open Office equivalent is). The list should load up with one URL per cell.

5. Now just copy and paste the resulting cleaned list into: Manage Hiding / List Util / ‘Perma-ban list’ in Google Hit Hider.

The advantage of this over the now-native Google blocking is that: i) it lets you break the 500 URL limit; ii) you can block domains en-masse rather than one at a time; and iii) it lets you easily import/export the blocklist, in order to share with colleagues etc.

Google launches Google Art

01 Tuesday Feb 2011

Posted by futurilla in JURN's Google watch, Spotted in the news

≈ Leave a comment

A new service from Google, Google Art Project…

“Explore museums from around the world, discover and view hundreds of artworks at incredible zoom levels, and even create and share your own collection of masterpieces.”

Based on the Google Maps technology and its familiar interface, the images are gigapixel and presented without watermarks. Just 17 gigapixel images to start with, and there are also StreetView-like tours of their museums. If images look a little blurry as you zoom in, then simply give time for the tiles to load (in a similar way to Google Earth), and the sharper tiles should appear.

The “Add” button for the creation of personal collections doesn’t seem to work in Firefox.

New Google search modifier

20 Thursday Jan 2011

Posted by futurilla in JURN's Google watch

≈ Leave a comment

Interesting new Google search modifier…

inblogtitle:keyword

The rising tide of Web spam

06 Thursday Jan 2011

Posted by futurilla in JURN's Google watch, My general observations

≈ Leave a comment

A big dollop of lazy journo-bluster has landed at The Guardian, over the amount of outright spam that’s been inveigling itself into the Google search-results.

This growing so-called backlash is largely down to some users thinking they can still type in dishwasher review and get good results. Those “two keywords is enough” days are over — just spend 50 minutes learning how to search properly, guys. Yet some people are going to find learning this more difficult than others — more and more people who not fully literate are now trying to use the web. They can’t skim-read the results very well, or remember how to do complex strings of search modifiers. The ‘advanced search’ forms scare them. All the more reason why we need to be teaching search literacy from infant school onward.

Perhaps the Googleplexers who do nothing else but weed for spam are being temporarily overwhelmed? There’s an obvious tidal wave of robot-registered domains being populated by robots with robot-made pages. 99% of this Web spam has never seen a human hand, other than in the plagiarised material that gets pirated, semi-garbled, and pasted into the page. So, hire as many people as it takes to rip out the spam. It’s not as though Google doesn’t have the cash to throw another 500 eyeballs at the problem.

The other problem that people seem to be raising in the Guardian comments is that we don’t really have a reliable hand-made search-engine for product reviews, one that is devoted to serving only reliable reviews from reliable sources — and nothing else. Certainly, I’ve never found one I like and feel I can trust, and which is comprehensive in its sources and relevant to the UK.

Competition for the Google CSE

04 Tuesday Jan 2011

Posted by futurilla in How to improve academic search, JURN's Google watch, Spotted in the news

≈ 1 Comment

IndexTank, custom search in a box. Nice idea. But it seems to be aimed at individual business looking to reduce their IT overheads, and is useless as a replacement for a Web-wide Google CSE…

“IndexTank doesn’t actively fetch data from you as a web crawler would do. Instead, your application sends IndexTank the data as soon as it is created or updated”

“not a standalone web search engine, and we don’t currently have a way for you to set it up directly through the Web. It requires downloading software such as a WordPress plugin (if you wanted to add better search to your blog, for example) or writing a program to interact with our servers.”

Worse, it can’t even auto-extract indexable text from the PDFs you send it…

“IndexTank, like other full-text search alternatives, indexes only text. However, for common formats like PDF or Word, it is very easy to parse them to obtain the readable text by using open source tools.”

I should mention some of the other ‘sort-of’ search-in-a-box options.

* The old and vulnerable (in the light of the Delicious closure) Yahoo BOSS

* Spinn3r. But it can only supply “A-list” blog content (so possibly not much use for hyperlocal indexing of a city-region), and you have to build your own widget to hook into its API.

* 80 Legs is a pricey monthly-subscription web-crawler. I’m uncertain if their stated ‘URL limit’ refers to the number of URLs on the originating site-list, or the number of files actually found by their crawler. If it’s the latter, you could run out of space very fast.

* And of course the new Blekko, which lets you upload a text file full of your selected URLs, and then uses them to create a ‘slashtag’ that delimits people’s searches. The last one is interesting, and I might eventually have a play around with it. Although possibly that’ll be when you’re no longer limited to 1,000 URLs, and are allowed to use wildcards in the URL list.

It’s great to see some competition emerging to Google CSEs, and perhaps it will eventually spur Google into offering a commercial ‘Deep’ Web-wide version of the Custom Search Engine:— full-text deep indexing of all the documents found at any website it’s pointed at; all the documents found are drawn on to produce your custom search results, every time; and the user gets 12,000 URLs to play with. Or perhaps Microsoft Bing will offer such a service. It might be limited to non-profits, so as to keep the SEO spivs out.

Spamming Google Scholar

22 Wednesday Dec 2010

Posted by futurilla in JURN's Google watch, Ooops!

≈ Leave a comment

Spamming Google Scholar. Very possible, or so it seems…

“…we conducted several tests on Google Scholar. The results show that academic search engine spam is indeed – and with little effort – possible: We increased rankings of academic articles on Google Scholar by manipulating their citation counts; Google Scholar indexed invisible text we added to some articles, making papers appear for keyword searches the articles were not relevant for; Google Scholar indexed some nonsensical articles we randomly created with the paper generator SciGen; and Google Scholar linked to manipulated versions of research papers that contained a Viagra advertisement.”

Beel, J. (2010)
Academic Search Engine Spam and Google Scholar’s Resilience Against it.
Journal of Electronic Publishing 13 (3), December 2010.

AROUND Google

15 Wednesday Dec 2010

Posted by futurilla in JURN's Google watch

≈ Leave a comment

A new Google search modifier… AROUND.

apples AROUND(3) pears

…gives results that contain the word “apples” within three words of “pears”.

[ Hat-tip: Researchbuzz ]

Google’s new ‘Advanced Reading Level’

10 Friday Dec 2010

Posted by futurilla in JURN's Google watch

≈ Leave a comment

Google has implemented a new filter that allows the filtering of search results by ‘reading level’. It’s accessed via the Advanced Search page, thus…

In a search for the term “reading level”, with the Reading Level set to Advanced, I still had a basic About.com page in the first page of results, as well as this blatant SEO spam page as result No.8.

A search for ‘tolkien + symbols’ showed better results, with a solid and useful first two pages of results. Although not that much different from the standard search, except that using Advanced Reading Level blocked a result from the scumbag SEO spam domain directhit.com on the second page of plain results.

How to import/export a list of banned URLs from the Google Noise Reduction script, for Firefox + GreaseMonkey.

20 Saturday Nov 2010

Posted by futurilla in How to improve academic search, JURN tips and tricks, JURN's Google watch

≈ 2 Comments

You may have spent some time building up a list of banned URLs for the Firefox addon Surfclarity, which strips unwanted domains from Google Search Results. Surfclarity no longer works with the latest Google changes, but the Greasemonkey script Google Noise Reduction does. In this tutorial we’ll swop the Surfclarity blacklist into the Google Noise Reduction blacklist.

1. In Firefox’s address bar, type: about:config.

2. Scroll down to extensions.surfclarity.patterns

Double click on the line of banned URLs you’ll find there, and copy them to Notepad.

3. Scroll further down to greasemonkey.scriptvals.http://exego.net//Google Noise Reduction.blacklist and take a look at the format. Note that it’s a little different than Surfclarity…

({‘britannia.com’:true, ‘oxfordjournals.org’:true, ‘tandf.co.uk’:true, ‘ingentaconnect.com’:true, ‘sagepub.com’:true, ‘myspace.com’:true, ‘experts-exchange.com’:true})

So we’re going to have to do some basic search-and-replace on our Surfclarity blacklist. Back up the Google Noise Reduction.blacklist if you want, as we’re going to overwrite it in a few moments.

4. Go back to Notepad and look at the list of Surfclarity URLs you just copied out.

Search for : and replace with : ‘ — note the space after the “:”.

Then search for : and replace it with ‘:true,

Now add ({‘ to the very start of this list, and ‘:true}) to the very end of this list.

Congratulations, you now have your SurfClarity list in Google Noise Reduction format.

5. Copy your new list to the clipboard, go back to greasemonkey.scriptvals.http://exego.net//Google Noise Reduction.blacklist, clear what’s in there at the moment, and then paste the new list in. You’re done.

Obviously, you can now also copy a backup of the Google Noise Reduction.blacklist

How to get the old Google Images back

13 Saturday Nov 2010

Posted by futurilla in JURN's Google watch

≈ Leave a comment

Remove the new Google Image Search’s increasingly annoying ‘Bing-bling’, by using Firefox + GreaseMonkey + a potent combination of Google Image Basic and Direct Images in Google Image Search!. Image search then reverts to how it used to be. Clicking on a thumbnail in the search-results takes you straight to the largest version. When searching for images “larger than…” you may need to tell Firefox (one-time only) what application to open the image with, rather than popping up a “where would you like to download this to…” I told it to open large images with Firefox itself, and large images then open in a new Firefox tab. Nice.

And, while you’re at it… Flickr: link all sizes.

News from JURN

~ search tool for open access content

Category Archives: JURN's Google watch

Moving from Google Noise Reduction to Google Hit Hider

Google launches Google Art

New Google search modifier

The rising tide of Web spam

Competition for the Google CSE

Spamming Google Scholar

AROUND Google

Google’s new ‘Advanced Reading Level’

How to import/export a list of banned URLs from the Google Noise Reduction script, for Firefox + GreaseMonkey.

How to get the old Google Images back