{"id":20361,"date":"2017-11-29T23:27:12","date_gmt":"2017-11-29T22:27:12","guid":{"rendered":"https:\/\/jurnsearch.wordpress.com\/?p=20361"},"modified":"2017-11-29T23:27:12","modified_gmt":"2017-11-29T22:27:12","slug":"mixnode","status":"publish","type":"post","link":"https:\/\/jurn.link\/jurnsearch\/index.php\/2017\/11\/29\/mixnode\/","title":{"rendered":"Mixnode"},"content":{"rendered":"<p><a href=\"https:\/\/commoncrawl.org\/2017\/11\/november-2017-crawl-archive-now-available\/\">Common Crawl has updated<\/a> and is now at &#8220;3.2 billion Web pages and 260 TiB of uncompressed content&#8221;. In September they added a list of university domains to the crawl. This time, for the first time, they&#8217;ve actively been trying to blacklist spam-network pages.<\/p>\n<p>The Crawl is also now including 300m+ new URLs from a paid service called <a href=\"https:\/\/www.mixnode.com\/\">Mixnode<\/a>, which looks like a very interesting on-demand custom-crawling and indexing service&#8230;<\/p>\n<blockquote><p>&#8220;Mixnode can breeze through thousands of URLs per second and download gigabytes of data per minute without a hitch.&#8221;<\/p><\/blockquote>\n<p>Presumably some of this comes via abstracting sections of the Common Crawl, then &#8216;filling in&#8217; the rest?<\/p>\n<p>Now all Mixnode needs is a half-decent &#8216;public search&#8217; front-end for a Mixnode crawl, and it&#8217;s &#8216;Build Your Own Search Engine&#8217; time &mdash; without the limitations of a Google CSE.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Common Crawl has updated and is now at &#8220;3.2 billion Web pages and 260 TiB of uncompressed content&#8221;. In September &hellip;<\/p>\n<p><a href=\"https:\/\/jurn.link\/jurnsearch\/index.php\/2017\/11\/29\/mixnode\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,16],"tags":[],"class_list":["post-20361","post","type-post","status-publish","format-standard","hentry","category-academic-search","category-spotted-in-the-news"],"_links":{"self":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/20361","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/comments?post=20361"}],"version-history":[{"count":0,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/20361\/revisions"}],"wp:attachment":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/media?parent=20361"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/categories?post=20361"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/tags?post=20361"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}