Internet Archive starts to ignore robots.txt files

The Internet Archive (and its Wayback Machine) has recently announced it has stopped honouring robots.txt files for some sites. A robots.txt is a simple text file which tells visiting crawler and harvesting bots that the site owners don’t want their content accessed, copied and (potentially irrevocably) made public somewhere else without their permission.

The Internet Archive is currently… “ignoring [robots.txt warnings at] U.S. government and military web sites”, and state that in future… “We are now looking to do this more broadly.”

This would seem to have a number of implications for repositories and journals. Especially in terms of things like retractions, ‘heavy harvesting’ of large numbers of large files, and also the practical implementation of the emerging legal concept of ‘the right to be forgotten’.

To anticipate this impending policy change at Internet Archive and to block their crawlers, you reportedly need to set up a way to “limit access by IP addresses” from the IA, and/or configure your site to block visiting clients named “ia_archiver”.

If you can’t do that — at first glance it looks a lot more complex than simply uploading a plain robots.txt file — then note that they say they will… “respond to removal requests sent to info@archive.org”. The latter option may be of special interest to hosted wordpress.com blogs and similar sites, which have no means of blocking the IA’s crawlers.

News from JURN

~ search tool for open access content

Internet Archive starts to ignore robots.txt files

Leave a Reply Cancel reply