Last modified: 2010-04-06 19:37:42 UTC
<FoeNyx> the « » (U+2003 em space) should be an unvalid article name, no ? <zwitter> all whitespace other than U+0020 should be
comment from bug 1971: > Moving a page to a title like [[« Pour l'Ukraine unie ! »]] create page with non > breakable space in the title, page move has been done here : > http://fr.wikipedia.org/w/index.php?title=Pour_une_Ukraine_unie_%21&action=history, > resulting page is > http://fr.wikipedia.org/wiki/%C2%AB%C2%A0Pour_l%27Ukraine_unie%C2%A0%21%C2%A0%C2%BB Regarding the non-breaking space (U+00A0) specifically, it's generally transformed silently into U+0020 spaces when it goes through the <textarea>->submit edit cycle and is not preserved, making it extra annoying.
*** Bug 1971 has been marked as a duplicate of this bug. ***
I've just done a little research on Unicode whitespace handling; the Zs, Zl, and Zp character classes seem to be relevant, and the set of them or some variant is what's counted by eg Java's Character.isSpace() and .NET's Char.isSpaceChar(). It might make sense to explicitly disallow the Zl and Zp chars (line separator and paragraph separator), and normalize all the Zs chars to spaces (well, underscores) in title processing. A quick grep of the current UnicodeData.txt database lists: 0020;SPACE;Zs;0;WS;;;;;N;;;;; 00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;; 1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;; 180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;; 2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;; 2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;; 2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;; 2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 202F;NARROW NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;;;;; 205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;; 2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;; 2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;
*** Bug 2042 has been marked as a duplicate of this bug. ***
There is another problem with UTF8 titles. The representation of a character in a foreign codepage looks like a normal character in out codepage. You may find examples in http://de.wikipedia.org/w/index.php?title=Spezial:Log&type=delete&user=&page=&limit=500&offset=50 Look for entries in 1-may-2005 3:45 - 3:55 h ("K.D.St.V. CarοIus Маgnus"). Please view this text in html code. Examples: <a href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolu%D1%95_%D0%9Ca%C9%A1nu%D1%95&action=edit" <a href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolu%D1%95_Magnus&action=edit" <a href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolus_%CE%9Cagnu%D1%95&action=edit" tsor (administrator of german WP)
(In reply to comment #5) > There is another problem with UTF8 titles. The representation of a character in > a foreign codepage looks like a normal character in out codepage. I reopened the bug 2042 as it's not exactly the same. (this bug is a subset of bug 2042 only about homograph pair of whitespaces)
Curly vs. straight quotes have been causing confusion at en lately as well.
(In reply to comment #7) > Curly vs. straight quotes have been causing confusion at en lately as well. this bug is for whitespace characters, the quotes confusion is probably more suited for the bug 2042
Before anything is done on this, obviously a check needs to be run on the various wikis to see if they use these. It seems probable that IDEOGRAPHIC SPACE, for instance, should not be blacklisted. In general, there are various reasons to use various types of spaces, and I think it would be best if these were normalized for storage but not blacklisted, so you can't have two article names that differ only in the type or number of spaces used but you can still have unusual spaces in character titles. This should be part of the eventual move to case-insensitivity for titles (bug 453).
*** Bug 12080 has been marked as a duplicate of this bug. ***
I think this was fixed by r55382 (and the follow-ups to it) back in 2009. Closing.