Last modified: 2014-10-30 19:44:26 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29446, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 27446 - U+200B ZERO WIDTH SPACE allowed in page titles
U+200B ZERO WIDTH SPACE allowed in page titles
Status: NEW
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
Depends on:
Blocks: unicode 42807
  Show dependency treegraph
Reported: 2011-02-16 10:19 UTC by Merlijn van Deen (test)
Modified: 2014-10-30 19:44 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Merlijn van Deen (test) 2011-02-16 10:19:04 UTC
Related bugs: #3969 Unicode tracking; #14600, #5732; #2593, #1524 (regarding usernames)

A bug in the pywikipedia framework [1] showed up when editing interwikis. This caused bot wars [2], which have been fixed by removing the U+200B ZERO WIDTH SPACE from end of the page title where the problems happened [3].

Should this character be allowed in page titles? And, more specifically, at the end of a page title?

Characters from the range U+2000-U+200A are already treated as spaces (and replaced by underscores). Since Unicode 4.0, U+200B is no longer considered whitespace by the Unicode Consortium.

To cite Brion Vibber in #14600:
> They're not technically illegal, but perhaps should be excluded as they
> wouldn't be useful.

and in #1524:
> *Invalid* characters (those that are illegal in XML or don't reliably cut and
> paste) need to be outright blocked in titles.

Although U+200B ZERO WIDTH SPACE seems to cut-and-paste on windows, it's not something I'd call 'reliable' - selecting characters from the left, moving to the right, it's easy enough not to select the U+200B ZERO WIDTH SPACE at the end of the page title. As such, I think it's reasonable not to allow the character, or to replace it with an underscore.

Comment 1 Merlijn van Deen (test) 2011-02-16 10:21:07 UTC
To clarify; the pywikipedia bug was caused by calling .strip() on the page title. When working with Unicode < 4.0, this will strip the U+200B character (python < 2.7), with Unicode > 4.0, this will *not* strip the U+200B character (python >= 2.7).
Comment 2 Brion Vibber 2011-02-22 20:34:32 UTC
I don't _think_ it should be legit to have this char at beginning/end of a title, as zero-width space is meant to be used as a separator to disable ligatures etc, and only makes sense within a span of non-whitespace characters.

In the middle of words, it may actually be required for some languages.

Correct behavior is _probably_ to strip this char from beginning/end during title normalization, while preserving it in the middle of words. Once that's done, a run of cleanupTitle & co should pretty transparently correct existing titles.

I would recommend double-checking the definitions and actual usage to make sure these assumptions are correct before changing rules, however.
Comment 3 Helder 2014-02-28 11:55:40 UTC
A user reported the existence of these two pages on Portuguese Wikipedia:
which appear on lists such as
as if they were two identically named pages.

Fortunatelly, this seems to be the only title where this character was used (at least on ptwiki):

Note You need to log in before you can comment on or make changes to this bug.