Another month, another search-engine for the well-thumbed corpus of academic articles in Computer Science. Semantic Scholar is a touch different though, as it’s been developed at the Paul Allen Institute for Artificial Intelligence and it just searches 3 million open access papers. As such I guess that most Computer Science students may come to think of it as just a much more elegantly designed and somewhat faster equivalent of Microsoft Academic, minus the pesky records with no PDF links.
Semantic Scholar reportedly plans to expand to the neurosciences and biomedical by 2016-18. And, of course, one should never underestimate the Microsoft tortoise/hare growth method (Allen is a Microsoft founder) — what looks like a lackluster tortoise at first slowly builds and redefines, and re-builds and expands again over the years, until suddenly it’s out in front of the race. That process stalled with the reported ceasing of further development on Microsoft Academic, but it may be that Semantic Scholar is effectively Microsoft’s arms-length second try at that? Just my guess.
As with most such ventures, it seems to be cloaking the allegedly A.I. / semantics-assisted development of something far more commercial and widely applicable: accurate automatic full-text detection (CORE could only get to around 27% with that on academic repositories, last I heard), then document structure evaluation, extraction, segmentation and re-formatting. Which is nice, if one only has to organise an interface for a very well-behaved corpus of Computer Science papers. Semantic Scholar certainly looks like it can do that, and elegantly too, though I’m not qualified to comment on its relevancy ranking or the alleged semantics aspects. But I suspect we’re still many decades from having an autobot that can tame the messy Wild West of open publishing in that manner.