Common Crawl has updated and is now at “3.2 billion Web pages and 260 TiB of uncompressed content”. In September they added a list of university domains to the crawl. This time, for the first time, they’ve actively been trying to blacklist spam-network pages.
The Crawl is also now including 300m+ new URLs from a paid service called Mixnode, which looks like a very interesting on-demand custom-crawling and indexing service…
“Mixnode can breeze through thousands of URLs per second and download gigabytes of data per minute without a hitch.”
Presumably some of this comes via abstracting sections of the Common Crawl, then ‘filling in’ the rest?
Now all Mixnode needs is a half-decent ‘public search’ front-end for a Mixnode crawl, and it’s ‘Build Your Own Search Engine’ time — without the limitations of a Google CSE.