{"id":19163,"date":"2017-04-26T07:25:10","date_gmt":"2017-04-26T06:25:10","guid":{"rendered":"https:\/\/jurnsearch.wordpress.com\/?p=19163"},"modified":"2017-04-26T07:25:10","modified_gmt":"2017-04-26T06:25:10","slug":"internet-archive-starts-to-ignore-robots-txt-files","status":"publish","type":"post","link":"https:\/\/jurn.link\/jurnsearch\/index.php\/2017\/04\/26\/internet-archive-starts-to-ignore-robots-txt-files\/","title":{"rendered":"Internet Archive starts to ignore robots.txt files"},"content":{"rendered":"<p>The Internet Archive (and its Wayback Machine) has recently announced it has stopped honouring robots.txt files for some sites.  A robots.txt is a simple text file which tells visiting crawler and harvesting bots that the site owners don&#8217;t want their content accessed, copied and (potentially irrevocably) made public somewhere else without their permission. <\/p>\n<p>The Internet Archive is currently&#8230; &#8220;ignoring [robots.txt warnings at] U.S. government and military web sites&#8221;, and state that in future&#8230; &#8220;We are now looking to do this more broadly.&#8221;  <\/p>\n<p>This would seem to have a number of implications for repositories and journals.  Especially in terms of things like retractions, &#8216;heavy harvesting&#8217; of large numbers of large files, and also the practical implementation of the emerging legal concept of &#8216;the right to be forgotten&#8217;.<\/p>\n<p>To anticipate this impending policy change at Internet Archive and to block their crawlers, you reportedly need to set up a way to &#8220;limit access by IP addresses&#8221; from the IA, and\/or configure your site to block visiting clients named &#8220;ia_archiver&#8221;. <\/p>\n<p>If you can&#8217;t do that &mdash; at first glance it looks a lot more complex than simply uploading a plain robots.txt file &mdash; then note that they say they will&#8230; &#8220;respond to removal requests sent to info@archive.org&#8221;.  The latter option may be of special interest to hosted wordpress.com blogs and similar sites, which have no means of blocking the IA&#8217;s crawlers.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Internet Archive (and its Wayback Machine) has recently announced it has stopped honouring robots.txt files for some sites. A &hellip;<\/p>\n<p><a href=\"https:\/\/jurn.link\/jurnsearch\/index.php\/2017\/04\/26\/internet-archive-starts-to-ignore-robots-txt-files\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8,16],"tags":[],"class_list":["post-19163","post","type-post","status-publish","format-standard","hentry","category-jurn-tips-and-tricks","category-spotted-in-the-news"],"_links":{"self":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/19163","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/comments?post=19163"}],"version-history":[{"count":0,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/19163\/revisions"}],"wp:attachment":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/media?parent=19163"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/categories?post=19163"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/tags?post=19163"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}