Last modified: 2014-10-30 19:44:26 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 27446 - U+200B ZERO WIDTH SPACE allowed in page titles
U+200B ZERO WIDTH SPACE allowed in page titles
Status: NEW
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: unicode 42807
  Show dependency treegraph
 
Reported: 2011-02-16 10:19 UTC by Merlijn van Deen (test)
Modified: 2014-10-30 19:44 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Merlijn van Deen (test) 2011-02-16 10:19:04 UTC
Related bugs: #3969 Unicode tracking; #14600, #5732; #2593, #1524 (regarding usernames)

A bug in the pywikipedia framework [1] showed up when editing interwikis. This caused bot wars [2], which have been fixed by removing the U+200B ZERO WIDTH SPACE from end of the page title where the problems happened [3].

Should this character be allowed in page titles? And, more specifically, at the end of a page title?

Characters from the range U+2000-U+200A are already treated as spaces (and replaced by underscores). Since Unicode 4.0, U+200B is no longer considered whitespace by the Unicode Consortium.

To cite Brion Vibber in #14600:
> They're not technically illegal, but perhaps should be excluded as they
> wouldn't be useful.

and in #1524:
> *Invalid* characters (those that are illegal in XML or don't reliably cut and
> paste) need to be outright blocked in titles.

Although U+200B ZERO WIDTH SPACE seems to cut-and-paste on windows, it's not something I'd call 'reliable' - selecting characters from the left, moving to the right, it's easy enough not to select the U+200B ZERO WIDTH SPACE at the end of the page title. As such, I think it's reasonable not to allow the character, or to replace it with an underscore.

[1] https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3182761&group_id=93107
[2] http://en.wikipedia.org/w/index.php?title=Podolsk&action=history
[3] http://bo.wikipedia.org/w/index.php?title=%E0%BD%94%E0%BD%BC%E0%BC%8B%E0%BD%91%E0%BD%BC%E0%BD%A3%E0%BC%8B%E0%BD%A6%E0%BD%B2%E0%BD%82&action=history
Comment 1 Merlijn van Deen (test) 2011-02-16 10:21:07 UTC
To clarify; the pywikipedia bug was caused by calling .strip() on the page title. When working with Unicode < 4.0, this will strip the U+200B character (python < 2.7), with Unicode > 4.0, this will *not* strip the U+200B character (python >= 2.7).
Comment 2 Brion Vibber 2011-02-22 20:34:32 UTC
I don't _think_ it should be legit to have this char at beginning/end of a title, as zero-width space is meant to be used as a separator to disable ligatures etc, and only makes sense within a span of non-whitespace characters.

In the middle of words, it may actually be required for some languages.

Correct behavior is _probably_ to strip this char from beginning/end during title normalization, while preserving it in the middle of words. Once that's done, a run of cleanupTitle & co should pretty transparently correct existing titles.

I would recommend double-checking the definitions and actual usage to make sure these assumptions are correct before changing rules, however.
Comment 3 Helder 2014-02-28 11:55:40 UTC
A user reported the existence of these two pages on Portuguese Wikipedia:
https://pt.wikipedia.org/wiki/Coming_Out_of_the_Dark
https://pt.wikipedia.org/wiki/Coming_Out_%E2%80%8B%E2%80%8Bof_the_Dark
which appear on lists such as
https://pt.wikipedia.org/wiki/Special:PrefixIndex/Coming_Out?uselang=en
as if they were two identically named pages.

Fortunatelly, this seems to be the only title where this character was used (at least on ptwiki):
http://tools.wmflabs.org/addshore/grep/?pattern=%E2%80%8B%E2%80%8B&lang=pt&wiki=wiki&ns=0

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links