Last modified: 2011-03-13 18:04:27 UTC
Somehow related to bug 4937:
We've got several complaints about http://www.archive.org/web/web.php
accidentally storing specific offensive revisions of Wikipedia's pages, especially
of user and user_talk pages (mainly privacy issues because of publication of
personal data). The Wayback Machine stores years old data and may also keep it,
even if the original page is gone, for those few "bug readers" who don't know ...
So revisions were already deleted on Wikipedia, but are still stored by the Wayback
crawler (in opposite e.g. to Google's cache, which simply updates/overwrites old data).
our users normally don't have an easy and suitable possibility to request the removal
of "their" data, because they have to prove their identity and the identity of a Wikipedia
account (possible but complicated).
Also this is a more general problem, because most users are not aware of the
Wayback Machine at all. A Common argument in the discussion about this topic is,
that there is no need for such an external storage because we've got our own history
which is widely distributed ...
To exclude the Internet Archive's crawler (and remove old documents there)
robots.txt should say:
I don't see any disadvantages in adding this, at least for NS:2 and NS:3, where
nearly all requests were reffering to, afaik.
It should be
We don't have our own history. We need the Internet Archive to protect us
against deletionist admins who would erase the early history of Wikipedia
without a second thought, on the basis that it's unencyclopedic or not part of
our mission or whatever.
If that is generally regarded as unwanted: What would be a feasible
method for removing specific user pages, and probably some others too,
– that were already deleted for good reason on Wikipedia, and on
particular request, of course – from the Archive then?
The recommended way appears to be to add specific pages to robots.txt... :)
Right, this is the option to use if we do not want to exclude ns:2 and ns:3 in general.
But this would mean that you have to change robots.txt pretty often, and also,
that this file would become rather long. Thinking in the long run, it is likely that
you would have to handle such requests on a daily basis. Would that be feasable?
In addition, if this methods gets better known somewhen, robots.txt could serve
as a focus for others who want to pitch on such issues to publish especially those
pages elsewhere ... odd.
Hmm, another option could be to authorise the Foundation/the Office with handling
such issues on individual request, presumably per email to archive.org (perhaps
with a bunch of specific pages for all different projects in one mail, if sending out
many mails for single requests would keep them too busy).
Note: We had a case recently were a user tried to get his pages removed through a
very complicated and long correspondence with archive.org. And finally he failed due
to being not able (from archive.org's view, afaik) to prove that he's really the same as
the Wikipedia account, and also because it was hard to communicate about non-English
So if someone (or a position) could be named as contact here
(either "please poke a dev to add it" or "please contaxt X, WMF")
I suggest to close this bug.
Component: Site requests
robots.txt can now be edited on-wiki by editing Mediawiki:robots.txt => closing this bug