Last modified: 2014-10-30 19:44:26 UTC
Related bugs: #3969 Unicode tracking; #14600, #5732; #2593, #1524 (regarding usernames) A bug in the pywikipedia framework [1] showed up when editing interwikis. This caused bot wars [2], which have been fixed by removing the U+200B ZERO WIDTH SPACE from end of the page title where the problems happened [3]. Should this character be allowed in page titles? And, more specifically, at the end of a page title? Characters from the range U+2000-U+200A are already treated as spaces (and replaced by underscores). Since Unicode 4.0, U+200B is no longer considered whitespace by the Unicode Consortium. To cite Brion Vibber in #14600: > They're not technically illegal, but perhaps should be excluded as they > wouldn't be useful. and in #1524: > *Invalid* characters (those that are illegal in XML or don't reliably cut and > paste) need to be outright blocked in titles. Although U+200B ZERO WIDTH SPACE seems to cut-and-paste on windows, it's not something I'd call 'reliable' - selecting characters from the left, moving to the right, it's easy enough not to select the U+200B ZERO WIDTH SPACE at the end of the page title. As such, I think it's reasonable not to allow the character, or to replace it with an underscore. [1] https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3182761&group_id=93107 [2] http://en.wikipedia.org/w/index.php?title=Podolsk&action=history [3] http://bo.wikipedia.org/w/index.php?title=%E0%BD%94%E0%BD%BC%E0%BC%8B%E0%BD%91%E0%BD%BC%E0%BD%A3%E0%BC%8B%E0%BD%A6%E0%BD%B2%E0%BD%82&action=history
To clarify; the pywikipedia bug was caused by calling .strip() on the page title. When working with Unicode < 4.0, this will strip the U+200B character (python < 2.7), with Unicode > 4.0, this will *not* strip the U+200B character (python >= 2.7).
I don't _think_ it should be legit to have this char at beginning/end of a title, as zero-width space is meant to be used as a separator to disable ligatures etc, and only makes sense within a span of non-whitespace characters. In the middle of words, it may actually be required for some languages. Correct behavior is _probably_ to strip this char from beginning/end during title normalization, while preserving it in the middle of words. Once that's done, a run of cleanupTitle & co should pretty transparently correct existing titles. I would recommend double-checking the definitions and actual usage to make sure these assumptions are correct before changing rules, however.
A user reported the existence of these two pages on Portuguese Wikipedia: https://pt.wikipedia.org/wiki/Coming_Out_of_the_Dark https://pt.wikipedia.org/wiki/Coming_Out_%E2%80%8B%E2%80%8Bof_the_Dark which appear on lists such as https://pt.wikipedia.org/wiki/Special:PrefixIndex/Coming_Out?uselang=en as if they were two identically named pages. Fortunatelly, this seems to be the only title where this character was used (at least on ptwiki): http://tools.wmflabs.org/addshore/grep/?pattern=%E2%80%8B%E2%80%8B&lang=pt&wiki=wiki&ns=0