Last modified: 2014-10-30 19:44:26 UTC
Related bugs: #3969 Unicode tracking; #14600, #5732; #2593, #1524 (regarding usernames)
A bug in the pywikipedia framework  showed up when editing interwikis. This caused bot wars , which have been fixed by removing the U+200B ZERO WIDTH SPACE from end of the page title where the problems happened .
Should this character be allowed in page titles? And, more specifically, at the end of a page title?
Characters from the range U+2000-U+200A are already treated as spaces (and replaced by underscores). Since Unicode 4.0, U+200B is no longer considered whitespace by the Unicode Consortium.
To cite Brion Vibber in #14600:
> They're not technically illegal, but perhaps should be excluded as they
> wouldn't be useful.
and in #1524:
> *Invalid* characters (those that are illegal in XML or don't reliably cut and
> paste) need to be outright blocked in titles.
Although U+200B ZERO WIDTH SPACE seems to cut-and-paste on windows, it's not something I'd call 'reliable' - selecting characters from the left, moving to the right, it's easy enough not to select the U+200B ZERO WIDTH SPACE at the end of the page title. As such, I think it's reasonable not to allow the character, or to replace it with an underscore.
To clarify; the pywikipedia bug was caused by calling .strip() on the page title. When working with Unicode < 4.0, this will strip the U+200B character (python < 2.7), with Unicode > 4.0, this will *not* strip the U+200B character (python >= 2.7).
I don't _think_ it should be legit to have this char at beginning/end of a title, as zero-width space is meant to be used as a separator to disable ligatures etc, and only makes sense within a span of non-whitespace characters.
In the middle of words, it may actually be required for some languages.
Correct behavior is _probably_ to strip this char from beginning/end during title normalization, while preserving it in the middle of words. Once that's done, a run of cleanupTitle & co should pretty transparently correct existing titles.
I would recommend double-checking the definitions and actual usage to make sure these assumptions are correct before changing rules, however.
A user reported the existence of these two pages on Portuguese Wikipedia:
which appear on lists such as
as if they were two identically named pages.
Fortunatelly, this seems to be the only title where this character was used (at least on ptwiki):