Last modified: 2011-03-13 18:04:27 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T7582, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 5582 - Disallow ia_archiver on user and user_talk pages (robots.txt)


Summary:	Disallow ia_archiver on user and user_talk pages (robots.txt)

Status:	RESOLVED WONTFIX

Product:	Wikimedia
Classification:	Unclassified
Component:	Site requests (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Lowest normal with 8 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	robots.txt
	Show dependency tree / graph

Reported:	2006-04-15 13:50 UTC by bdk
Modified:	2011-03-13 18:04 UTC (History)
CC List:	2 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description bdk 2006-04-15 13:50:56 UTC

Somehow related to bug 4937:
We've got several complaints about http://www.archive.org/web/web.php
accidentally storing specific offensive revisions of Wikipedia's pages, especially 
of user and user_talk pages (mainly privacy issues because of publication of 
personal data). The Wayback Machine stores years old data and may also keep it, 
even if the original page is gone, for those few "bug readers" who don't know ...

So revisions were already deleted on Wikipedia, but are still stored by the Wayback
crawler (in opposite e.g. to Google's cache, which simply updates/overwrites old data).

According to 
http://www.sims.berkeley.edu:8000/research/conferences/aps/removal-policy.html
our users normally don't have an easy and suitable possibility to request the removal 
of "their" data, because they have to prove their identity and the identity of a Wikipedia
account (possible but complicated).

Also this is a more general problem, because most users are not aware of the 
Wayback Machine at all. A Common argument in the discussion about this topic is, 
that there is no need for such an external storage because we've got our own history 
which is widely distributed ...

To exclude the Internet Archive's crawler (and remove old documents there) 
robots.txt should say:

 User-agent: ia_archiver
 Disallow: /

I don't see any disadvantages in adding this, at least for NS:2 and NS:3, where 
nearly all requests were reffering to, afaik.

Comment 1 Marco 2006-04-15 15:05:50 UTC

It should be
User-agent: ia_archiver
Disallow: /wiki/User
Disallow: /wiki/Benutzer
etc.

Marco

Comment 2 Tim Starling 2007-01-14 15:23:46 UTC

We don't have our own history. We need the Internet Archive to protect us
against deletionist admins who would erase the early history of Wikipedia
without a second thought, on the basis that it's unencyclopedic or not part of
our mission or whatever.

Comment 3 bdk 2007-04-03 19:04:54 UTC

If that is generally regarded as unwanted: What would be a feasible 
method for removing specific user pages, and probably some others too,
– that were already deleted for good reason on Wikipedia, and on 
particular request, of course – from the Archive then?

Comment 4 Brion Vibber 2007-04-03 19:30:07 UTC

http://www.archive.org/about/faqs.php#The_Wayback_Machine
http://web.archive.org/web/20050305142910/http://www.sims.berkeley.edu/research/conferences/aps/removal-policy.html

The recommended way appears to be to add specific pages to robots.txt... :)

Comment 5 bdk 2007-04-03 20:24:04 UTC

Right, this is the option to use if we do not want to exclude ns:2 and ns:3 in general.

But this would mean that you have to change robots.txt pretty often, and also, 
that this file would become rather long. Thinking in the long run, it is likely that 
you would have to handle such requests on a daily basis. Would that be feasable?
In addition, if this methods gets better known somewhen, robots.txt could serve 
as a focus for others who want to pitch on such issues to publish especially those 
pages elsewhere ... odd.

Hmm, another option could be to authorise the Foundation/the Office with handling 
such issues on individual request, presumably per email to archive.org (perhaps 
with a bunch of specific pages for all different projects in one mail, if sending out 
many mails for single requests would keep them too busy). 

Note: We had a case recently were a user tried to get his pages removed through a 
very complicated and long correspondence with archive.org. And finally he failed due 
to being not able (from archive.org's view, afaik) to prove that he's really the same as 
the Wikipedia account, and also because it was hard to communicate about non-English 
matters. 

So if someone (or a position) could be named as contact here 
(either "please poke a dev to add it" or "please contaxt X, WMF") 
I suggest to close this bug.

Comment 6 Siebrand Mazeland 2008-08-13 12:01:35 UTC

Component: Site requests

Comment 7 JeLuF 2008-09-13 00:07:54 UTC

robots.txt can now be edited on-wiki by editing Mediawiki:robots.txt => closing this bug

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links