Last modified: 2011-03-13 18:04:27 UTC
Somehow related to bug 4937: We've got several complaints about http://www.archive.org/web/web.php accidentally storing specific offensive revisions of Wikipedia's pages, especially of user and user_talk pages (mainly privacy issues because of publication of personal data). The Wayback Machine stores years old data and may also keep it, even if the original page is gone, for those few "bug readers" who don't know ... So revisions were already deleted on Wikipedia, but are still stored by the Wayback crawler (in opposite e.g. to Google's cache, which simply updates/overwrites old data). According to http://www.sims.berkeley.edu:8000/research/conferences/aps/removal-policy.html our users normally don't have an easy and suitable possibility to request the removal of "their" data, because they have to prove their identity and the identity of a Wikipedia account (possible but complicated). Also this is a more general problem, because most users are not aware of the Wayback Machine at all. A Common argument in the discussion about this topic is, that there is no need for such an external storage because we've got our own history which is widely distributed ... To exclude the Internet Archive's crawler (and remove old documents there) robots.txt should say: User-agent: ia_archiver Disallow: / I don't see any disadvantages in adding this, at least for NS:2 and NS:3, where nearly all requests were reffering to, afaik.
It should be User-agent: ia_archiver Disallow: /wiki/User Disallow: /wiki/Benutzer etc. Marco
We don't have our own history. We need the Internet Archive to protect us against deletionist admins who would erase the early history of Wikipedia without a second thought, on the basis that it's unencyclopedic or not part of our mission or whatever.
If that is generally regarded as unwanted: What would be a feasible method for removing specific user pages, and probably some others too, – that were already deleted for good reason on Wikipedia, and on particular request, of course – from the Archive then?
http://www.archive.org/about/faqs.php#The_Wayback_Machine http://web.archive.org/web/20050305142910/http://www.sims.berkeley.edu/research/conferences/aps/removal-policy.html The recommended way appears to be to add specific pages to robots.txt... :)
Right, this is the option to use if we do not want to exclude ns:2 and ns:3 in general. But this would mean that you have to change robots.txt pretty often, and also, that this file would become rather long. Thinking in the long run, it is likely that you would have to handle such requests on a daily basis. Would that be feasable? In addition, if this methods gets better known somewhen, robots.txt could serve as a focus for others who want to pitch on such issues to publish especially those pages elsewhere ... odd. Hmm, another option could be to authorise the Foundation/the Office with handling such issues on individual request, presumably per email to archive.org (perhaps with a bunch of specific pages for all different projects in one mail, if sending out many mails for single requests would keep them too busy). Note: We had a case recently were a user tried to get his pages removed through a very complicated and long correspondence with archive.org. And finally he failed due to being not able (from archive.org's view, afaik) to prove that he's really the same as the Wikipedia account, and also because it was hard to communicate about non-English matters. So if someone (or a position) could be named as contact here (either "please poke a dev to add it" or "please contaxt X, WMF") I suggest to close this bug.
Component: Site requests
robots.txt can now be edited on-wiki by editing Mediawiki:robots.txt => closing this bug