128Tb SD cards

17 Tuesday Jul 2018

Posted by futurilla in Spotted in the news

128Tb can now fit on a retail SD card. That’s 134 million megabytes. At five x 200kb .PDF papers per megabyte, such a postage-stamp sized SD card can now hold 600 million academic papers. It’s actually around 650 million, but I leave plenty of space for the indexing software and its index.

New type of Custom Search Engine

17 Tuesday Jul 2018

Posted by futurilla in JURN's Google watch, Spotted in the news

≈ Leave a comment

Google Custom Search has slightly expanded the range of services.

The Standard and Non-profit CSE services are unchanged.

They also offer an CSE via a JSON API: there’s no Google branding on that, but you pay $5 per thousand queries, and are limited to 10,000 search queries per day.

The new and fourth offering is a “Site Restricted JSON API”: it also requires the same “$5 per thousand search queries” payment. But if you search across no more than 10 URLs, then there’s no daily traffic limit.

I guess a use-case for this would be a huge and very heavily-used corporation like Boeing, where you want to offer your clients the quickest and most accurate way to search across all your technical reports, papers and manuals — which are spread across 10 different URLs? That use-case would likely need some guarantees from Google, though, on the spread and depth of the indexing.

4m Open Library books, full-text, deep search

14 Saturday Jul 2018

Posted by futurilla in Academic search, Spotted in the news

≈ Leave a comment

You can now ‘search inside’ all 4m Open Library books held at Archive.org, with your search seemingly constrained to just those books (and not the jumble that Archive.org also hosts). Nice results, with multi-snippets from deep inside the full-text of the books, plus phrase highlighting. This looks like excellent work, and it takes advantage of new tweaks by Archive.org’s search leader Giovanni Damiola.

A serious history researcher is still going to need to pound Archive.org itself and go through everything, but at first glance this seems to be a useful time-saver for those who only need to search the upper layers of the service.

The ultimate goal of the Open Library is “One Web page for every book ever published”. Think of it as one of those annoying university repositories where 95% of the full-text is not available yet, but will be one day… so “here’s a record page instead”. But in this case it’s for all books, and already has a substantial amount of full-text for free.

Birmingham Museums Trust images are CC0

13 Friday Jul 2018

Posted by futurilla in Spotted in the news

≈ Leave a comment

All public domain Birmingham Museums Trust images are now CC0. Birmingham UK, not USA. The Trust will still make a charge for use of files larger than 3MB at 300dpi. Currently the site is still offering small and low Web-res pictures, but the Trust states they… “will introduce a new Digital Asset Management System in late 2019”, when presumably we’ll get new buttons to access large files. Until then, I’m guessing that one has to manually email them to ask for a larger CC0 version. The 3Mb barrier is a nice model, but AI upscaling of images will probably ensure that it only lasts a few years.

Burne-Jones, sketch for Theseus and the Minotaur in the Labyrinth.

itty.bitty

07 Saturday Jul 2018

Posted by futurilla in Open Access publishing, Spotted in the news

≈ Leave a comment

itty.bitty, new from the design leader at Dropbox. Itty.bitty uses the URL to contain the text of a Web page. The page can have 2,000 bytes, or about 170-200 words, if you’re going to support legacy Web browsers such as Internet Explorer.

No hosting server is required, and as the data sits after the # symbol. What comes after the # is meant to be page-position related, and as such it never gets sent to the server.

The base64 link code is not pretty…

But the same link/page displays as…

How it works

This link is the page.

Scripting and hyper-linking is enabled in such pages, so long as it all fits in the URL length. The code can’t do images, but you can do old-school ASCII-art.

The main drawback seems to be that you’re going to have to be 1000% sure that your text is exactly as you want it before you make the link… because there’s no after-post editing for errors or updating of dead hyperlinks in the page.

In which case you’d ideally consistently version and date-stamp the

">How it works

bit, as…

">How it works (v.0.1 | 14/07/2018)

…so that people and search-tools can discover later updated versions of the same content. Otherwise the itty.bitty system risks becoming an intertwingled mess of half-baked and old/broken stuff that you (and probably Google) won’t want in search results.

I’m guessing that advanced Web browsers such as Brave will soon ‘add a feature’ in relation to this, by enabling much longer data-carrier code to be read from URLs. Perhaps also some simple automatic “…and can we find a later version of this itty.bitty.site?” query, done inside the browser. There would, however, also have to be some sort of dynamic ownership hash embedded in the page, to protect against impersonation of the page-author. Perhaps the system of authoring an ownership-hash and datestamp could be combined into a simple ‘one-click operation’ in a desktop authoring tool.

Anyway, it’s one example of the coming uncensorable Decentralized Web.

500px – Creative Commons close-down and a Getty-grab

01 Sunday Jul 2018

Posted by futurilla in JURN tips and tricks, Spotted in the news

≈ 1 Comment

Flickr-alternative 500px has announced it is set to close down the sharing of images under Creative Commons. The new owners have partnered with evil megacorp Getty and as a consequence are…

“disabling the ability for people to upload or download photos shared under Creative Commons licenses.”

So far as I can tell from tests, the CC options and search have not yet been disabled.

But it’s not that desperate in terms of effects on serious picture researchers — I mean, when did you last find a print-sized commercial-use CC picture at 500px, via Google Images? Never, in my experience. It’s probably because the 500px user-base tends strongly toward makers of naff me-too ‘stock’ and ‘tourist’ images, which are of no use to academics and historians (and of little use to discriminating stock-hunters, either). But the decision is annoying for creatives who have a 500px subscription. Which includes me, after the once-great Flickr was crashed and burned by Yahoo.

I think the way for active makers to get around the new block may be just to tag with the phrase “Creative Commons” in the keywords, and also add a ‘please freely use this image’ comment as the creator. But not to explicitly place the picture under a CC license (which it seems won’t even be an option, soon). Let’s hope the new owners of 500px are not so crass as to also go in and delete all their users’ “Creative Commons” keyword tags.

More importantly, for 500px users….

“If you’re a contributing photographer who has not opted out of distribution, your images may be selected for inclusion on Getty Images”.

I find that also applies to people who have not chosen to actively try to sell stock on the 500px site. Here’s how to prevent Getty from grabbing all your pictures, in the next day or so…

1. Go to “Settings”…

2. Find “Distribution”, tick the check-box and save.

Presumably the plan is that all the commercial-use CC 500px images show up for sale at Getty next week, and that the 500px users then have no way to pull them back and/or delete them?

WordPress at 15

29 Tuesday May 2018

Posted by futurilla in Spotted in the news

≈ Leave a comment

Wow, 15 years of WordPress, and still the best and most reliable and generous social-media company! It probably helps that the software itself is GLP and run by a Foundation, and that the .com side is still basically (as far as I can tell) run by the founder Matt Mullenweg. Thanks Matt!

On ResearchGate

22 Tuesday May 2018

Posted by futurilla in Academic search, Official and think-tank reports, Spotted in the news

≈ Leave a comment

What publishers can take away from the latest early career researcher research ($), a five-page “Industry Update” for the journal Learned Publishing, 28th April 2018…

“ResearchGate is unquestionably the scholarly elephant in the room, which despite being just 10 years old boasts 15 million research members and is still growing at a rate of knots. … publisher offerings can look monastic and parochial by comparison. […] It looks rather like the new scholarly world order.” […] “Much depends on whether ECRs [early-career-researchers] take their millennial beliefs in sharing, openness, and transparency into leadership positions. [and if] publishers [start] feeding ResearchGate rather than competing with it – [making it] a publishing Amazon”.

The Update is by the team doing an industry-supported three-year cohort study of search and similar practices. Their first two reports are Early Career Researchers: the harbingers of change? Year One 2016 and now also the Year Two 2017 report, both free and public at the same website. Apparently the cohort of around 100+ is all science and social studies.

Also fairly new, and related, “ResearchGate and Academia.edu as networked socio-technical systems for scholarly communication: a literature review” (OA), in the Research in Learning Technology journal, 20th February 2018…

“a thorough understanding is still lacking of how these sites operate as networked socio-technical systems reshaping scholarly practices and academic identity. This article analyses 39 empirical studies published in peer-reviewed journals with a specific focus on ResearchGate and Academia.edu.”

Google Search currently suggests circa 72-million full-text PDFs at ResearchGate, although given the above Industry Update statement on ‘the 15m members’ we can probably assume some 10m of those PDFs are just CVs (which are nearly all excluded from JURN, by the way). Remove other fluff and I guess there might be circa 50m proper papers there. It would then be interesting to work out what “the uniques” are, by removing the papers freely available elsewhere in repositories and OA journals and suchlike. I’d very roughly guess that including ResearchGate PDFs in JURN may bring in some 5m to 8m papers not found elsewhere.

New book: Shadow Libraries

21 Monday May 2018

Posted by futurilla in How to improve academic search, Spotted in the news

≈ Leave a comment

New from MIT Press and under CC, Shadow Libraries: Access to Educational Materials in Global Higher Education (PDF). Also available in paperback via Amazon etc. Surveys the evolution of the trend that has today become Sci-Hub, Libgen.io etc.

India culls 4,305 dubious journals

20 Sunday May 2018

Posted by futurilla in Spotted in the news

≈ Leave a comment

Nature India, May 2018: “India culls 4,305 dubious journals from approved list”…

“India culls 4,305 dubious journals from approved list. … The University Grants Commission (UGC), which funds and oversees higher-education in India, has removed 4,305 spurious journals from a list of some 30,000 publications used for weighing academic performance.”

The Delhi Declaration on Open Access recently stated “20,000+ journals being published from India” alone.

News from JURN

~ search tool for open access content

Category Archives: Spotted in the news

128Tb SD cards

New type of Custom Search Engine

4m Open Library books, full-text, deep search

Birmingham Museums Trust images are CC0

itty.bitty

500px – Creative Commons close-down and a Getty-grab

WordPress at 15

On ResearchGate

New book: Shadow Libraries

India culls 4,305 dubious journals