Last modified: 2010-04-06 19:37:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3414, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 1414 - Unicode whitespaces allowed in article title
Unicode whitespaces allowed in article title
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal with 5 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 1971 12080 (view as bug list)
Depends on:
Blocks: unicode
  Show dependency treegraph
 
Reported: 2005-01-27 14:43 UTC by FoeNyx
Modified: 2010-04-06 19:37 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description FoeNyx 2005-01-27 14:43:52 UTC
<FoeNyx> the « » (U+2003 em space) should be an unvalid article name, no ?
<zwitter> all whitespace other than U+0020 should be
Comment 1 Brion Vibber 2005-04-25 04:50:05 UTC
comment from bug 1971:
> Moving a page to a title like [[« Pour l'Ukraine unie ! »]] create page with non
> breakable space in the title, page move has been done here :
> http://fr.wikipedia.org/w/index.php?title=Pour_une_Ukraine_unie_%21&action=history,
> resulting page is
> http://fr.wikipedia.org/wiki/%C2%AB%C2%A0Pour_l%27Ukraine_unie%C2%A0%21%C2%A0%C2%BB

Regarding the non-breaking space (U+00A0) specifically, it's generally transformed silently into U+0020 spaces when it goes 
through the <textarea>->submit edit cycle and is not preserved, making it extra annoying.
Comment 2 Brion Vibber 2005-04-25 04:50:24 UTC
*** Bug 1971 has been marked as a duplicate of this bug. ***
Comment 3 Brion Vibber 2005-04-25 06:10:15 UTC
I've just done a little research on Unicode whitespace handling; the Zs, Zl, and Zp character classes seem to be relevant, and the 
set of them or some variant is what's counted by eg Java's Character.isSpace() and .NET's Char.isSpaceChar().

It might make sense to explicitly disallow the Zl and Zp chars (line separator and paragraph separator), and normalize all the Zs 
chars to spaces (well, underscores) in title processing.

A quick grep of the current UnicodeData.txt database lists:

0020;SPACE;Zs;0;WS;;;;;N;;;;;
00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;;
2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
202F;NARROW NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;;;;;
205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;

2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;

2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;
Comment 4 FoeNyx 2005-05-01 16:52:10 UTC
*** Bug 2042 has been marked as a duplicate of this bug. ***
Comment 5 tsor 2005-05-01 19:44:38 UTC
There is another problem with UTF8 titles. The representation of a character in
a foreign codepage looks like a normal character in out codepage.

You may find examples in
http://de.wikipedia.org/w/index.php?title=Spezial:Log&type=delete&user=&page=&limit=500&offset=50
Look for entries in 1-may-2005 3:45 - 3:55 h ("K.D.St.V. CarοIus Маgnus").
Please view this text in html code. Examples:

<a
href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolu%D1%95_%D0%9Ca%C9%A1nu%D1%95&amp;action=edit"

<a
href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolu%D1%95_Magnus&amp;action=edit"
<a
href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolus_%CE%9Cagnu%D1%95&amp;action=edit"

 
tsor  (administrator of german WP)

Comment 6 FoeNyx 2005-05-02 11:01:15 UTC
(In reply to comment #5)
> There is another problem with UTF8 titles. The representation of a character in
> a foreign codepage looks like a normal character in out codepage.

I reopened the bug 2042 as it's not exactly the same.
(this bug is a subset of bug 2042 only about homograph pair of whitespaces)
Comment 7 Rick Block 2006-07-11 19:13:58 UTC
Curly vs. straight quotes have been causing confusion at en lately as well.
Comment 8 FoeNyx 2006-09-17 13:38:18 UTC
(In reply to comment #7)
> Curly vs. straight quotes have been causing confusion at en lately as well.

this bug is for whitespace characters, the quotes confusion is probably more
suited for the bug 2042
Comment 9 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-07-08 02:23:21 UTC
Before anything is done on this, obviously a check needs to be run on the various wikis to see if they use these.  It seems probable that IDEOGRAPHIC SPACE, for instance, should not be blacklisted.  In general, there are various reasons to use various types of spaces, and I think it would be best if these were normalized for storage but not blacklisted, so you can't have two article names that differ only in the type or number of spaces used but you can still have unusual spaces in character titles.  This should be part of the eventual move to case-insensitivity for titles (bug 453).
Comment 10 Brion Vibber 2007-12-06 18:14:27 UTC
*** Bug 12080 has been marked as a duplicate of this bug. ***
Comment 11 Ilmari Karonen 2010-04-06 19:37:42 UTC
I think this was fixed by r55382 (and the follow-ups to it) back in 2009.  Closing.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links