Now on Archive.org in a handy open .torrent form, the Corpus for 45 million OA papers in 46Gb and dated January 2019. Gathered by Semantic Scholar in Computer Science, Neuroscience, and Biomedical. I get the impression that there’s been some bycatch with Semantic Scholar, but it’ll be overwhelmingly in those areas.

Archive.org has also recently placed online a whole bundle of similar bibliographic datasets from disparate sources, with torrents. This seems to be part of their FatCat project, to ingest and preserve all available records and metadata from mainstream scholarly journal publishing. Open snapshots of the resulting combined (and presumably cleaned and aligned) FatCat mega-base are also available, the last one dated 30th January 2019 and under CC0. It weighs in at a modest 80Gb, so have a spare hard-drive ready.