Google’s new Dataset Search tool

07 Friday Sep 2018

Posted by futurilla in JURN's Google watch, Spotted in the news

Google has a new Dataset Search tool. It looks good.

An initial test search for Krita (the open source paint software) didn’t pick up anything, so it is just limited to datasets and is not also bringing in general file-names from FTP servers.

A wide search for Antarctica Cephalopods then gave a good set of 25 results, all of which were record pages that appeared to place their dataset under CC or to be public domain (NASA etc). There doesn’t appear to be any way to then load a further set of results, or to do a further keyword search within the record-pages of the results.

New type of Custom Search Engine

17 Tuesday Jul 2018

Posted by futurilla in JURN's Google watch, Spotted in the news

≈ Leave a comment

Google Custom Search has slightly expanded the range of services.

The Standard and Non-profit CSE services are unchanged.

They also offer an CSE via a JSON API: there’s no Google branding on that, but you pay $5 per thousand queries, and are limited to 10,000 search queries per day.

The new and fourth offering is a “Site Restricted JSON API”: it also requires the same “$5 per thousand search queries” payment. But if you search across no more than 10 URLs, then there’s no daily traffic limit.

I guess a use-case for this would be a huge and very heavily-used corporation like Boeing, where you want to offer your clients the quickest and most accurate way to search across all your technical reports, papers and manuals — which are spread across 10 different URLs? That use-case would likely need some guarantees from Google, though, on the spread and depth of the indexing.

Getty kills Google Image’s ‘View image’ button: how to fix it

16 Friday Feb 2018

Posted by futurilla in JURN tips and tricks, JURN's Google watch, Spotted in the news

≈ 1 Comment

Under pressure from commercial image library Getty, Google Images has removed a key button from its search results. It’s the “View Image” button, which allowed people to view an image in isolation, against whatever colour they have set as a background for the Web browser.

The removal is easily fixed with a simple new script:

Firefox: Google Images Fix for Greasemonkey.

Chrome and Chrome-compatible: Google Search "View Image" Button

If you also want to change the default background colour (white can be better for screen-shots of logos for Facebook posts, to get an edge), in Firefox you can change the Web browser’s default background from black thus: Tools | Options | Content | Colours | Background | OK.

There are also press reports that the “search by image” icon in the Google Images search box is to be removed, also due to Getty pressure. But I see it’s still there on the UK version of Google Images.

On doing nation-specific Web search

31 Wednesday Jan 2018

Posted by futurilla in JURN tips and tricks, JURN's Google watch

≈ Leave a comment

In Autumn 2017 Google announced that Google Search would ignore the country domain of its service, and instead serve you national results based on what Google thinks your geographic location is…

“the choice of country service will no longer be indicated by domain. Instead, by default, you’ll be served the country service that corresponds to your location.”

Here’s my quickstart on some of the nation-specific research options which can route around this. You either need to:

i) use the likes of DuckDuckGo and add national URL Parameters to the end of your bookmarked URL: e.g. Hungary. Top results are not great in that instance, with BBC, Wikipedia and Guardian cruft, but they quickly become relevant as you scroll down. Adding site:hu helps a lot, at the cost of knocking out local grassroots blogs on WordPress and Hungarian .org and .com sites etc.

DuckDuckGo is now actually better than Google, in my opinion, for picture research. Though you will have to home-brew a Creative Commons filter within your search terms.

ii) Go to Google’s Advanced Search settings and (for now) you can request that Google Search “narrow your results” by nation. Clunky, but it may prove useful. I imagine there must be a browser plugin that allows this setting to be swiftly switched across various nations.

iii) use a VPN proxy in your Web browser. The Opera web browser has a free and sturdy VPN built in, but all you can do with it these days is to select broad regions rather than nations (as used to be the case). Adequate for things like quickly getting past region-blocking on public domain resources at Hathi, etc, but not that useful if you just want to research ceramics in Morocco.

iv) use a few free VPN such as Browsec. This offers three or four free national VPN nodes, of a limited access duration (10 minutes or so before it becomes unresponsive). Again, useful for researchers wanting to access region-locked Hathi books or YouTube videos etc. Such freebie VPNs also offer an enticingly big list of other national nodes for paid users…

v) The TOR browser. Google’s new move potentially leaves sensitive ‘business researcher traffic’ open to being snooped on and tracked by hostile/piratic nations, who may either clandestinely run and/or can tap into VPN traffic. As such, smaller business — especially those in a larger supply-chain but without security-savvy IT departments — might also look into the anonymous TOR browser’s capabilities before doing intensive country research. It’s my understanding that some TOR exit nodes can be geolocated to nations, while others appear to be free of geolocation, and apparently one can switch between these types and choose which nation the exit node is in.

So far as I’m aware, JURN has for some time now auto-detected your home nation and served results accordingly. Some types of user can route around this somewhat, by searching in a local alphabet and encasing words or phrases in quote marks (“مقارنة”) which in this case should mean the majority of search results are in Arabic.

Google Shorter?

07 Thursday Dec 2017

Posted by futurilla in JURN's Google watch

≈ Leave a comment

I just ran a search on Google Scholar, and Scholar decided to present me with only two results (from Elsevier and Springer). The other 231 results (perfectly valid, often also from Elsevier and Springer) were hidden behind a small link to “See all results”. A curious new behaviour…

It seems we may need a browser add-on that forces “show all results” as the default page of results.

One way to fix your broken Google News RSS feeds, at November 2017

04 Saturday Nov 2017

Posted by futurilla in JURN tips and tricks, JURN's Google watch

≈ Leave a comment

The new RSS change at Google News makes their existing keyword-based RSS feeds defunct. It affects the RSS feeds that collect all Google News items with a headline/snippet containing the words ‘bunny’ + ‘fluffy’, for instance. I don’t know if the generic catch-all ‘Science’, ‘Health’ etc RSS feeds are affected, as I don’t use those.

Those keyword-based feeds will now need to be changed. Changed slowly and manually and individually by slogging down the list in one’s RSS feedreader. It’s a big task to do, for some, and journalists and editors and bloggers will have hundreds (if not thousands) of these feeds set up.

So far as I can see there’s no way to export the OPML from one’s desktop RSS feedreader and then simply do a global search-replace of the Google News URL paths in Notepad++, then bring the OPML back in. The URLs are too complex and varied in their structures to allow that.

One way of tackling the change is as follows:

Aim: Open our list of feeds in Excel and extract only the Google News ones, thus making it relatively easy for a worker to run through them all and discover the new ones.
Software required: the free Notepad++ and MS Office Excel with Sobolsoft’s Excel Remove Text addin.

1. Export your OPML master file from your RSS feedreader / newsreader.

2. Right-click on this and open the OPML in Notepad++. Search/replace "/> with "/>; and then manually go through and add a ; to the end of the remaining few lines which now lack them.

3. Search/replace all , (i.e.: all the commas) and change these to &&&&.

4. Save a backup of the changed OPML, then save another copy from Notepad++ — this time as “feeds.csv” which makes it a comma-separated Excel file. “But there are no commas left” you cry. That doesn’t matter, as Excel will treat the ; instances as if they were commas. And it won’t be terminally confused by commas sitting within the URLs, as we just changed them all to &&&&.

5. You can now load feeds.csv in MS Office’s Excel spreadsheet package. If you successfully put a ; at the end of each line of the OPML, Excel will happily load the file and it will display correctly, meaning in a similar way to the clear structured view you saw in Notepad++.

6. You’re now able to extract all the lines containing the phrase “Google News” and then do the same for “news.google”. There are a number of complex ways to do this, involving fiendish formulas, but a very easy way is with Sobolsoft’s Excel Remove Text, Spaces & Characters From Cells add-in. This gives Excel a number of very useful functions, including “Clear all cells not containing X”. Select all lines. Then clear everything not containing Google News. You can then ‘sort A-Z’, to get a neat list of all your defunct Google News feeds, one per line.

7. Select all lines with content in them. Then use the same add-in to “Remove all text before…” xmlUrl=" (which is the query command in the URL). Then “Remove all text after…” &output=

You can continue doing this sort of search/replace, and thus end up with a fairly clean set of the keywords and phrases and knockout -keywords which you were using for each Google News URL. For instance, you can search/replace %22 with ” to get recognisable search phrases again, inside the URL.

If you have hundreds or thousands of these, they can now be passed to a gig worker at Fivver.com etc, tasked with working down your nicely cleaned one-per-line list to discover the new working RSS URLs from Google News. While they’re at it, you may as well pay them to discover the Bing News equivalents.

You may also want them to use a VPN in order to also snag the Google News USA equivalent URLs, if you’re in the UK etc. Although it appears possible that simply changing the end of the new URLs from ?hl=en-GB&gl=GB&ned=uk to ?hl=en&gl=US&ned=us does the trick and gets the USA version. Google News USA obviously has better coverage, and is perhaps updated more quickly. For instance, a UK-centric search for: newcastle-under-lyme -police in Google News UK has no search results. The same from the USA site has one valid result in a local freesheet two hours ago. Such timeliness may matter for journalists with deadlines to meet.

8. You don’t then need to create a new OPML without any Google News URLs, and try to import it back to your newsreader etc. That’s a hassle and the OPML will probably break. So it’s easier to just let the defunct Google News URLs sit there and do nothing, since they’re not doing any harm. Some newsreader software may eventually flag them as defunct, and may even offer the ability to mass-delete your defunct feeds after 1st December 2017. Apparently that’s the date Google has set for the current feeds to die altogether.

9. Once your Fiverr gig worker etc comes back with the new URLs, either add in your new working Google News URLs by hand, or (if you have lots of them set up) have your Fivver gig worker format them up as a valid OPML file for bulk import to your newsreader. That’s very simple to do, once you have a newly-working Google News sample line to show them, although I think there are website converters that will turn a one-per-line RSS URL list into a valid OPML with ease.

That’s the most efficient way I can think of for handling the changeover.

How to get your new RSS feed from Google News

03 Friday Nov 2017

Posted by futurilla in JURN's Google watch

≈ 1 Comment

Annoyingly, Google appears to have just removed all its keyword-based RSS feeds for Google News. One gets the message…

This RSS feed URL is deprecated, please update. New URLs can be found in the footers at https://news.google.com/news.

But all it’s possible to get there is the generic national Spotlight headlines, as linked in the footer of the main Google News page…

https://news.google.com/news/rss/headlines?gl=GB&ned=uk&hl=en-GB

And even that feed “has no articles” when loaded into a feedreader.

What you actually need to do is to first run a new Google News search, then the new RSS feed link will appear in the footer of the page of search results.

If, at the same time as you’re fiddling with this annoying change-over, you want to swop out your Google News RSS for a working Bing News RSS feed, here’s how:

1. Do a keyword or phrase-based News search as usual, at Bing News.
2. Add -keyword to knock out unwanted stories (e.g. -police -NHS)
3. Then re-sort the search results by date.
4. Add &format=rss to the end of the URL. This turns it into a RSS feed from Bing News.
5. Now plug your new RSS feed into your newsreader.

Face it

31 Tuesday Oct 2017

Posted by futurilla in JURN's Google watch

≈ Leave a comment

Google Images needs an additional filter. Something like: “Face with lots of complex background, people doing stuff”, as well as “Face”. Otherwise, no matter what your search terms are, with “Face” you just get head-and-shoulders mug-shots and boring zoomed-in snaps of conference presenters (why do people even make the latter?).

Pop off, Google…

07 Saturday Oct 2017

Posted by futurilla in JURN's Google watch

≈ Leave a comment

More junk in the Google Search box? It seems so, in the form of another layer of distractingly dumb autosuggest. Which is now on individual words, even those at the end of a long-chain search query, as a ‘pop-down’.

No, Google — when I am searching for “public domain”, I have no interest in “domain names”. An apparently hyper-intelligent search company jammed with semantics experts and AI should know that by now.

Thankfully it can be hidden with AdBlock Plus’s Element Hiding Helper.

Another Google CSE dashboard glitch?

26 Friday May 2017

Posted by futurilla in JURN tips and tricks, JURN's Google watch

≈ Leave a comment

The recent changes to the Google CSE services appear to have introduced another glitch. The problem happens when adding new URL entries into your Google CSE. For instance, you can no longer add…

http://www.nnns.org.uk/sites/nnns.org.uk/files/

… and reliably select “Include all pages whose address contains this URL”. Oh yes, the Dashboard will let you save it that way… but then go back and open the URL up again. You’ll see that the CSE dashboard has refused to accept the setting you gave the URL, and has instead defaulted the URL to: “Include just this specific page or URL pattern I have entered”.

The problem with this is that you didn’t explicitly enter http://www.nnns.org.uk/sites/nnns.org.uk/files/* With the * wildcard making the “Include just this specific page or URL pattern I have entered” functional. Without the wildcard, the http://www.nnns.org.uk/sites/nnns.org.uk/files/ URL is null and void on that setting, and may as well have not been added to your CSE.

This has only just started happening, and the “Include all pages whose address contains this URL” setting is sticky on entries made prior to about 24 hours ago. Which makes me think it’s probably a temporary glitch, inadvertently introduced during yesterday’s switch from three-options to two-options for settings on individual URLs.

If you’re working on a CSE over the weekend / Bank Holiday (UK), you should be aware of this problem, as it probably won’t be fixed by Google until early next week. You’ll probably want to keep a .txt file of all the URLs you add which you have to use a /* for, because you may need to manually change them back once the problem gets fixed.

News from JURN

~ search tool for open access content

Category Archives: JURN's Google watch

Google’s new Dataset Search tool

New type of Custom Search Engine

Getty kills Google Image’s ‘View image’ button: how to fix it

On doing nation-specific Web search

Google Shorter?

One way to fix your broken Google News RSS feeds, at November 2017

How to get your new RSS feed from Google News

Face it

Pop off, Google…

Another Google CSE dashboard glitch?