Last modified: 2014-04-29 17:19:02 UTC
MediaWiki allows characters in the U+0080 to U+009F range in articles titles and bodies. These charactes should never appear in valid HTML. I suggest they are handeled like characters in the U+0000 to U+001F range. Using them in article titles/URLs should lead to a "Bad title" error. When using them in the article body they should be replaced with U+FFFD upon save.
Should be added to title chars blacklist. While technically legal, these are technically control characters. (Unlike the ASCII control characters they are allowed in XML however.)
(In reply to comment #1) > Should be added to title chars blacklist. While technically legal, these are technically control characters. (Unlike > the ASCII control characters they are allowed in XML however.) No they are not. I any of these characters is present the page will fail to validate. Unlike a page which contains a CR, for example.
This may be an error in your validation tool. XML 1.0 explicitly includes them among the allowed character ranges, see: http://www.w3.org/TR/2004/REC-xml-20040204/#charsets
(In reply to comment #3) > This may be an error in your validation tool. > > XML 1.0 explicitly includes them among the allowed character ranges, see: > http://www.w3.org/TR/2004/REC-xml-20040204/#charsets Interesting. I was using w3C's validator: http://validator.w3.org/check?uri=http%3A%2F%2Fen.wikipedia.org%2Fw%2Findex.php%3Ftitle%3D%25C2%2580.
Yeah; the description of the error contains some definite mistakes: "HTML uses the standard UNICODE Consortium character repertoire, and it leaves undefined (among others) 65 character codes (0 to 31 inclusive and 127 to 159 inclusive) that are sometimes used for typographical quote marks and similar in proprietary character sets." 1) Unicode most definitely *does* define these, you can see them in the code charts and the character database: http://www.unicode.org/charts/PDF/U0080.pdf http://www.unicode.org/Public/UNIDATA/ 2) The list of points mentioned there as undefined includes tab, newline, and carriage return, which are *most definitely* both defined in Unicode and allowed in HTML. 3) So far as I'm aware, neither HTML nor XHTML declares these characters to be disallowed, and XML only disallows a subset of 0-31 (minus tab, newline, and carriage return).
Compare http://test.wikipedia.org/wiki/User:R._Koot/C1-1 (which uses numeric entities) with http://test.wikipedia.org/wiki/User:R._Koot/C1-2 (which simply uses the characters directly). C1-1 validates, while C1-2 doesn't. They also look different (under Firefox 1.0.7/SUSE 10.0). C1-1 displays characters from the Windows-1252 character set in a truetype font, while C1-2 displays them in a bitmapped font and also shows glyphs where there are blanks at C1-1. I viewed C1-2 earlier today on Firefox 1.5/Windows 2000. All the characters are black except for one, which displays as a question mark. There is definitly some compatibility-stuff going one here. It could be possible that XML only allows character in the U+0080-U+009F range to be represented using nermeric entities?
No, XML allows them completely. Anyway that's not really relevant, we probably want to ban them just to avoid confusion. :)
re #7: I'm not sure about u+00 up to u+1F, IIRC the allowed characters are HT, LF, and CR. Anything else including FF and VT is bad. The range u+7F (not u+80, it starts at 127) up to u+9F used to be bad. In XML 1.1 (caveat: XHTML 1.0 is XML 1.0) u+85 NEL was declared to be okay, because the EBCDIC folks have a single NEL elsewhere in addition to their own variants of CR and LF. But that's XML 1.1, at the moment u+7F up to u+9F is marked as invalid by the W3C validator (for Unicode charsets, or independent of that for NCRs  up to Ÿ).
Can't the software assume that a browser sending characters in the 7F ... 9F range is sending Windows CP-1252 typographic characters? In this case, shouldn't they just be converted to the Unicode equivalents and entered thus into the database, once and for all? I can't imagine that there is any utility in entering the equivalent Unicode values into wikitext—aren't they all control characters which have no valid display, or are only useful in a text terminal?
I believe that most modern web browsers display these characters assuming that they are Windows CP-1252 anyway, so why not explicitly enter into the database what is being assumed?
Someone knowledgeable at the W3C has concluded that the SGML, XML 1.0, XML 1.1, HTML 4.01 and XHTML 1.0 specifications are inconsistent and unclear on this point, but suggests that the correct behaviour of the W3C validator is to reject these characters as invalid <http://www.w3.org/People/cmsmcq/2007/C1.xml>.
Five years later, MediaWiki still allows these C1 control codes (http://en.wikipedia.org/wiki/C0_and_C1_control_codes), even though they can cause problems especially when appearing in filenames (the file can only be used by copy-pasting the name from the file page, and it's a mystery to the user why). As far as I can tell these codes are not valid characters in XML (http://en.wikipedia.org/wiki/Valid_characters_in_XML), with the possible exception of U+0085 which if possible should be translated to a newline (I think). Can we do something about this?
This is pretty bad.