Last modified: 2014-02-16 05:55:46 UTC
A username consisting of all spaces made its way into the German Wikipedia dump file. The article it happened on is at http://de.wikipedia.org/w/index.php?title=Negativ-Positiv_Verfahren&action=history Since the username field is not marked as space-preserving Parse::MediaWikiDump completely ignored its contents in this case. I have a feeling a username of all spaces is not supposed to be allowed to exist. Tyler
Hallo! If you go to http://de.wikipedia.org/w/index.php?title=Negativ-Positiv_Verfahren&action=history and click on the "space" link you will come to http://de.wikipedia.org/wiki/Benutzer_Diskussion:%C2%A0 there to http://de.wikipedia.org/wiki/Spezial:Contributions/%C2%A0 no email specified or emails from other users disabeled The problem is known since August see http://de.wikipedia.org/wiki/Benutzer_Diskussion:%C2%A0 The user name contains Unicode Character 'NO-BREAK SPACE - U+00A0 http://www.fileformat.info/info/unicode/char/00a0/index.htm HTML Entity (decimal)   (hex)   (named) UTF-8 (hex) 0xC2 0xA0 (c2a0) %c2%a0 %C2%A0 http://en.wikipedia.org/wiki/User:%C2%A0 is known already from http://bugzilla.wikimedia.org/show_bug.cgi?id=1524#c9 Changing the name would be an administrative task either at WP:DE or better at all projects. I do not know the policy about this. Please clarify this at the local wiki, via a mailing list as [Wikide-l], [Wikitech-l] etc. or via IRC at irc://irc.freenode.net/mediawiki . Marking this bug as a duplicate of bug 1524: usernames should use unicode whitelist http://fr.wikipedia.org/wiki/%C2%A0 is mentioned at bug 2173 comment 3 bug 2173: Fatal error when removing an article with an whitespace title from the watchlist best regards reinhardt [[user:gangleri]] *** This bug has been marked as a duplicate of 1524 ***
This isn't a duplicate of bug 1524, that deals with having a whitelist for registered usernames, but this particular username also happens to break the XML schema.
Thanks Ævar! I did not read the second paragraph with the attention that would be required. Please look what happens at http://en.wikipedia.org/wiki/User:%C2%A0 and http://fr.wikipedia.org/wiki/%C2%A0 Please change the summary in order to reflect the new / major problem Thanks in advance!
I don't understand, does this really break dumps?
Also wondering. How to exactly reproduce that it "breaks dumps"?
If the XML schema indicates data is not white space preserving then white space is not significant and there is no difference between " ", " ", " ", "\t\n\n\n\t\t\t\t\t\t\t\t\t \n\n]n" etc. If a user name exists where white space is significant it becomes impossible to transmit using a non-space preserving data type. Thus it's not actually possible to get the user names correctly and this is rather broken.