Added to JURN

The findings provide first-time size estimates of ProQuest and EBSCOHost and indicate that Google Scholar’s size might have been underestimated so far by more than 50%. By our estimation Google Scholar, with 389 million records, is currently the most comprehensive academic search engine.

With the later proviso that there are likely to be many duplicates and near-duplicates, with such tools reporting…

the number of all indexed records on a database, not the number of unique records indexed. This means duplicates, incorrect links, or incorrectly indexed records are all included in the size metrics provided by ASEBDs.

As you can see, the article coins the ugly and unreadable “ASEBDs” for “academic search engines and bibliographic databases”. MASTs might be more mellifluous — Massive Academic Search Tools.

Added to JURN

06 Tuesday Nov 2018

Posted by futurilla in New titles added to JURN

≈ Leave a comment

MidAmerica

Midwestern Miscellany

Block autosuggestions from the Google Scholar search box

05 Monday Nov 2018

Posted by futurilla in How to improve academic search, JURN tips and tricks

≈ Leave a comment

For those who know what they’re looking for, and how to type… here’s how to block the dumb auto-suggestions from appearing on the Google Scholar search-box:

1. In the UBlock Origin Web browser addon, open the My Filters list (Go: Icon | Slider Controls Icon | My filters tab).

2. Paste in the line…

google.*##[class^="gs_md_"]

3. Save the List and exit it. Reload Google Scholar, and the flickery and distracting (and almost always very wrong) drop-down suggestions are gone.

A variant of the above ‘block line’ with probably also work in similarly advanced ad-blocker addons.

News you can lose…

03 Saturday Nov 2018

Posted by futurilla in Spotted in the news

≈ Leave a comment

It looks like stories from U.S. news outlets, those that blank UK and European visitors, are now simply being removed from the Google News results. Spotted today, under the Google News results…

Annoying for those inclined to turn on their VPN and see the news story regardless. But, in practice, the publications still blocking overseas visitors are such low-grade regional newspapers that it’s no loss.

Clips and flicks

03 Saturday Nov 2018

Posted by futurilla in Spotted in the news

≈ Leave a comment

All U.S. film-makers can now crack anti-copying technologies on content ($ paywalled at law.com), if they need that content for ‘fair use’ use in a new production…

“Digital Millennium Copyright Act (DMCA) exemptions aren’t just for documentary filmmakers any more. The U.S. Copyright Office and Library of Congress last week broadened a DMCA exception to now allow more filmmakers to circumvent anti-copying technology and rip short video clips for purposes of commentary and criticism.”

However, it isn’t a free-for-all. Note that the PDF for the rules states that this new measure is specifically for…

“where the clip is used for parody or its biographical or historically significant nature”.

In a drama movie, the “commentary and criticism” would thus presumably be seen to be implied by the nature of the scene, rather than done in a directly academic or journalistic manner. For instance, I can imagine a dramatised scene of dancing on the beach as the Apollo 11 rocket lifts off behind the dancers. This scene would be a sort of implied commentary on the optimism engendered in the nation by the historically significant moment of sending men to the Moon. And if the high-res source needed for that was only available from Time-Life rather than NASA, then their Blu-ray disc could be cracked and a clip used as the background in the composite. Actually these days it’s probably easier to do it with 3D models and copy of Vue, but some may want the original footage — and historical personages can’t simply be conjured up in the same way.

Also, as the word “clip” is used and video is assumed in the PDF’s text, that leaves hazy the cracking of content protection to obtain a high-res still picture. A film-maker might need such a still for a Ken Burns “pan and scan” type film, and could perhaps argue that the still was required as a irreplaceable source needed to make the film’s video “clip”. But that’s probably something to be clarified in a future round of rule changes.

Retraction Watch Database

27 Saturday Oct 2018

Posted by futurilla in Spotted in the news

≈ Leave a comment

Retraction Watch now has a unified database of retracted papers.

Practical blog search at the end of 2018

26 Friday Oct 2018

Posted by futurilla in JURN tips and tricks

≈ 1 Comment

It’s the back end of 2018 and there’s still no really useful and comprehensive search tool for recent blog posts, other than the main Google Search. And even that is iffy. Given that we’re approaching Halloween, I decided to do a quick group test with the simple keyword Lovecraft. He’s a good choice because so much utter trash floods onto the Web in his name. If a search can deal with Lovecraft, it should be able to handle much else.

* Google News: Can filter by ‘blogs’ and by ‘date’, but the results are laughable — are there really only eight blog posts on Lovecraft in October 2018, from worthy long-form and timely-news bloggers? I think not. (Another test for ‘Staffordshire’ suggests News | Blogs is almost all just press-release outlets and similarly worthless pseudo-blogs).

* Google Search: The inblogtitle:keyword modifier is no longer useful in search, as it now returns only 10 irrelevant results when used with Lovecraft. One used to be able to find sites that Google ‘knew’ were blogs, and had a keyword in their main blog title. Google Search has also removed ?tbm=blg from their URL options.

* WordPress.com internal cross-blog search: Simple to use, the results looks pretty, but it obviously has very mediocre coverage of its own blogs. Many expected and well-respected blogs do not appear at all. Users need to be aware that they are not seeing results from the entire range of non-spam WordPress.com hosted blogs.

I would suspect that DuckDuckGo may be using this WordPress.com results set as a de facto anti-spam whitelist, since that would explain its curious big gaps in the coverage for WordPress.com blogs. The same may be true of the dismal Bing — the only saving grace for which is the excellence of the Bing News | Most Recent results, which you can RSS-ise by adding &format=rss to the URL. By comparison, NewsNow is nowhere.

* You Got Blogs, a Google CSE: Fairly good at pulling the top three currently-active blogs to the top of the results, but thereafter turns to mush. If the user then sorts by Date on a single keyword, the results are far less useful, mainly because You Got Blogs is indexing all *.wordpress.com/* pages rather than just the blog posts via *.wordpress.com/20*/* You Got Blogs is reliant on Google Search, since it’s a CSE, and thus for many blogs Google will only show the most recently-indexed post or else just the front page (e.g. you make seven posts a week, but Google will only show searchers the post it has most recently indexed, and the others will be un-findable). It’s thus an impossible balancing act for You Got Blogs (or any other blog-focussed CSE): if they don’t do a global index of *.wordpress.com/* then they miss a whole lot of results.

* Regrettably setting up a Google CSE (for *.wordpress.com and *.blogspot.com etc) is not an option. I’ve tried it and practice it doesn’t work well, when one sorts by Date. It’s sort-of-ok on a straight search, if making a first search looking for blogs on one’s topic, though the main Google Search would do better. A CSE picks up and lifts to the top of the results some very out-of-date and moribund blogs, and obviously can’t deliver usable sort-by-date results.

* Social Mention. Search restricted just to ‘Blogs’. Pathetic results from ‘Blogs’. No results at all, for ‘Microblogs’. Top three results were very similar to the WordPress.com internal, then a huge gap in time. My guess is they’re blending together the WordPress.com and Bing APIs, and to no great effect.

* DuckDuckGo: Should, theoretically, be good. But is mediocre. It all-but ignores key Lovecraft blogs, blogs which rank very highly in Google Search. I should note that the Duck is excellent in many other respects, especially the relevance of its Image Search. But is still lacks breadth and depth.

* Instant RSS Search Engine. No longer appears to work, even when tested in multiple browsers.

For niche news gatherers wishing to supplement their RSS feedreader and break out of the tiny-minded Twitterbubble, the best option at the end of 2018 is thus to set up a bookmarks folder in your Web browser with the following:

site:wordpress.com/2018/10/ “Lovecraft” -zombie -game -movie
site:blogspot.com/2018/10/ “Lovecraft” -zombie -game -movie

Vary according to your desired keyword and knockout words, obviously. These URLs will work because all blog posts on Blogger and WordPress have the date embedded in their URL.

These bookmarks should be set to run on Google Search and DuckDuckGo and Yandex (the latter with a &lang=en English only filter in the URL). Right-click on the finished Bookmarks folder, select “Open All” and they all load.

Of course, this doesn’t pick up self-hosted blogs, only the free ones. And, obviously you’ll have to manually go in and incrementally change the date numbering in the target URLs, at the end of each month. Thus it’s not a perfect solution. (Nor can this solution be amalgamated into a Google CSE, for the reasons stated above).

Once the searches have loaded, switching through to a “week” or “24 hour” view will require the copious use of Google Hit Hider by Domain, to weed the spam and unwanted results. Google Hit Hider knocks out unwanted domains from search results, and does it very well. (Google Hit Hider can run on Yandex, it just needs the results reloaded, in order for its blocking buttons to appear).

Even having set up such a one-click Bookmarks folder, we also still have the problem of Google Search sometimes only offering the front page of a timely and frequently updated blog, rather than its most recent post URLs. In practice though, for a ‘last 24 hours’ search, you don’t actually need a site: modifier…

site:wordpress.com “Lovecraft” -zombie -game -movie

All you need is ‘last 24 hours’ filter alone, and Google Search will lift some of the best content into the first two pages of results. Kind of useful, as it can thus catch self-hosted blogs, albeit jumbled among legacy news sources and updating catalog sites etc. Even so, you’ll want Google Hit Hider when working at the 24 hour level.

Also useful, inside your new folder, will be a similarly hard-coded Google Images search URL for the last 24 hours or week…

“keyword” -pinterest -youtube -twitter -wikipedia -tumblr -instagram

… and so on. It only takes a few seconds to visual check the results, and such timely visual results are often useful re: new books, conference posters etc. Keep eBay listings in the mix as they can suggest interesting blog post topics, about old vintage stuff. Again, we’re not keying the search to blogs only, and thus Google Hit Hider is your friend here (it also works on Google Images results – block on Google Search, and it’s also blocked on Images).

There are of course also a whole bunch of “request a demo” agency services which claim to offer social media sentiment tracking. They seem to be of the ‘if you have to ask the price, you can’t afford it’ sort. There’s one free and public service worth a look, Social Searcher. Very slow to load a search, but it’s pretty and it works. It’s no use for blogs, though, but seems useful if you want to quickly glance across recent Facebook and Twitter posts. It covers some other ephemeral sharing sites, but their signal gets swamped by Facebook and Twitter. Not that that matters much as it’s almost all blather and parroting, of no news value. To prevent results turning into a wall of hashtags, the tags panels can be blocked in uBlock Origin with social-searcher.*##[class^="rezults-item-tags"]

Text Cleanup 2.0 – now free

24 Wednesday Oct 2018

Posted by futurilla in JURN tips and tricks, Spotted in the news

≈ 1 Comment

I’m pleased to see that Text Cleanup 2.0 is now freeware. It’s Windows desktop software from 2003 that “fixes” text automatically when you copy-paste it. For instance, by unwrapping a chunk of text that has hard line-breaks. Text Cleanup has a nice balance of power and ease-of-use, can save user presets, and still runs fine on a Windows 8.x desktop.

The Art Institute of Chicago’s CC0 pictures

24 Wednesday Oct 2018

Posted by futurilla in Spotted in the news

≈ Leave a comment

The Art Institute of Chicago now has 44,000 items from its collection downloadable as pictures under a CC0 licence. I did a test search for cat. What struck me first was the rich range.

My excitement was dampened when I realised that most of these results had no hi-res download. What I should have done was spotted the easy-to-miss faded “filters” button, up top, which when clicked pops out a sidebar. In the sidebar you can tick to filter by “Public domain”, which gives you the results with the downloadable images.

The filtered results are still fairly impressive, but of course lack the nicer “wow” illustrations made after about the 1910s. Some images download without file extensions, possibly because they already have a . in their title (e.g. “Honorable Mr. Cat”)…

Some of the search substitutions are rather dumb, for instance if you search for plague you get plaque.

The pictures seem to mostly be around 2,000 to 3,000px and 96dpi. There’s no sign-up needed, and access is free and public.

News from JURN

~ search tool for open access content

Added to JURN

Google Scholar at 389 million

Added to JURN

Block autosuggestions from the Google Scholar search box

News you can lose…

Clips and flicks

Retraction Watch Database

Practical blog search at the end of 2018

Text Cleanup 2.0 – now free

The Art Institute of Chicago’s CC0 pictures