Common Crawl has updated and is now at “3.2 billion Web pages and 260 TiB of uncompressed content”. In September they added a list of university domains to the crawl. This time, for the first time, they’ve actively been trying to blacklist spam-network pages.

The Crawl is also now including 300m+ new URLs from a paid service called Mixnode, which looks like a very interesting on-demand custom-crawling and indexing service…

“Mixnode can breeze through thousands of URLs per second and download gigabytes of data per minute without a hitch.”

Presumably some of this comes via abstracting sections of the Common Crawl, then ‘filling in’ the rest?

Now all Mixnode needs is a half-decent ‘public search’ front-end for a Mixnode crawl, and it’s ‘Build Your Own Search Engine’ time — without the limitations of a Google CSE.