Last modified: 2014-04-29 17:19:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T7732, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 5732 - MediaWiki allows characters in the U+0080 to U+009F range
MediaWiki allows characters in the U+0080 to U+009F range
Status: NEW
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.24rc
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
http://en.wikipedia.org/wiki/%C2%80
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-04-27 04:43 UTC by Ruud Koot
Modified: 2014-04-29 17:19 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Ruud Koot 2006-04-27 04:43:50 UTC
MediaWiki allows characters in the U+0080 to U+009F range in articles titles and
bodies. These charactes should never appear in valid HTML. I suggest they are
handeled like characters in the U+0000 to U+001F range. Using them in article
titles/URLs should lead to a "Bad title" error. When using them in the article
body they should be replaced with U+FFFD upon save.
Comment 1 Brion Vibber 2006-04-27 17:28:38 UTC
Should be added to title chars blacklist. While technically legal, these are technically control characters. (Unlike 
the ASCII control characters they are allowed in XML however.)
Comment 2 Ruud Koot 2006-04-27 20:59:08 UTC
(In reply to comment #1)
> Should be added to title chars blacklist. While technically legal, these are
technically control characters. (Unlike 
> the ASCII control characters they are allowed in XML however.)

 No they are not. I any of these characters is present the page will fail to
validate. Unlike a page which contains a CR, for example. 
Comment 3 Brion Vibber 2006-04-27 22:38:01 UTC
This may be an error in your validation tool.

XML 1.0 explicitly includes them among the allowed character ranges, see:
http://www.w3.org/TR/2004/REC-xml-20040204/#charsets
Comment 4 Ruud Koot 2006-04-27 23:02:23 UTC
(In reply to comment #3)
> This may be an error in your validation tool.
> 
> XML 1.0 explicitly includes them among the allowed character ranges, see:
> http://www.w3.org/TR/2004/REC-xml-20040204/#charsets


Interesting. I was using w3C's validator:
http://validator.w3.org/check?uri=http%3A%2F%2Fen.wikipedia.org%2Fw%2Findex.php%3Ftitle%3D%25C2%2580.
Comment 5 Brion Vibber 2006-04-27 23:08:02 UTC
Yeah; the description of the error contains some definite mistakes:

"HTML uses the standard UNICODE Consortium character repertoire, and it leaves undefined (among others) 
65 character codes (0 to 31 inclusive and 127 to 159 inclusive) that are sometimes used for typographical 
quote marks and similar in proprietary character sets."

1) Unicode most definitely *does* define these, you can see them in the code charts and the character 
database:
http://www.unicode.org/charts/PDF/U0080.pdf
http://www.unicode.org/Public/UNIDATA/

2) The list of points mentioned there as undefined includes tab, newline, and carriage return, which are 
*most definitely* both defined in Unicode and allowed in HTML.

3) So far as I'm aware, neither HTML nor XHTML declares these characters to be disallowed, and XML only 
disallows a subset of 0-31 (minus tab, newline, and carriage return).
Comment 6 Ruud Koot 2006-04-27 23:37:38 UTC
Compare http://test.wikipedia.org/wiki/User:R._Koot/C1-1 (which uses numeric
entities) with http://test.wikipedia.org/wiki/User:R._Koot/C1-2 (which simply
uses the characters directly). C1-1 validates, while C1-2 doesn't. They also
look different (under Firefox 1.0.7/SUSE 10.0). C1-1 displays characters from
the Windows-1252 character set in a truetype font, while C1-2 displays them in a
bitmapped font and also shows glyphs where there are blanks at C1-1. I viewed
C1-2 earlier today on Firefox 1.5/Windows 2000. All the characters are black
except for one, which displays as a question mark. There is definitly some
compatibility-stuff going one here. It could be possible that XML only allows
character in the U+0080-U+009F range to be represented using nermeric entities?
Comment 7 Brion Vibber 2006-04-28 00:13:46 UTC
No, XML allows them completely.

Anyway that's not really relevant, we probably want to ban them just to avoid confusion. :)
Comment 8 omniplex 2006-05-26 06:13:39 UTC
re #7: I'm not sure about u+00 up to u+1F, IIRC the allowed characters 
are HT, LF, and CR. Anything else including FF and VT is bad.

The range u+7F (not u+80, it starts at 127) up to u+9F used to be bad.
In XML 1.1 (caveat: XHTML 1.0 is XML 1.0) u+85 NEL was declared to be
okay, because the EBCDIC folks have a single NEL elsewhere in addition
to their own variants of CR and LF. 

But that's XML 1.1, at the moment u+7F up to u+9F is marked as invalid
by the W3C validator (for Unicode charsets, or independent of that for
NCRs  up to Ÿ).
Comment 9 Michael Zajac 2006-08-01 22:52:36 UTC
Can't the software assume that a browser sending characters in the 7F ... 9F range is sending Windows CP-1252 typographic 
characters?  In this case, shouldn't they just be converted to the Unicode equivalents and entered thus into the database, once 
and for all?  

I can't imagine that there is any utility in entering the equivalent Unicode values into wikitext—aren't they all control 
characters which have no valid display, or are only useful in a text terminal?

Comment 10 Michael Zajac 2006-08-01 22:54:54 UTC
I believe that most modern web browsers display these characters assuming that they are Windows CP-1252 anyway, so why 
not explicitly enter into the database what is being assumed?
Comment 11 Ruud Koot 2007-03-24 14:34:18 UTC
Someone knowledgeable at the W3C has concluded that the SGML, XML 1.0, XML 1.1,
HTML 4.01 and XHTML 1.0 specifications are inconsistent and unclear on this
point, but suggests that the correct behaviour of the W3C validator is to reject
these characters as invalid <http://www.w3.org/People/cmsmcq/2007/C1.xml>.
Comment 12 Rd232 2012-06-25 12:46:13 UTC
Five years later, MediaWiki still allows these C1 control codes (http://en.wikipedia.org/wiki/C0_and_C1_control_codes), even though they can cause problems especially when appearing in filenames (the file can only be used by copy-pasting the name from the file page, and it's a mystery to the user why). As far as I can tell these codes are not valid characters in XML (http://en.wikipedia.org/wiki/Valid_characters_in_XML), with the possible exception of U+0085 which if possible should be translated to a newline (I think). Can we do something about this?
Comment 13 Dan Garry 2014-04-29 16:15:40 UTC
This is pretty bad.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links