Last modified: 2012-12-10 20:52:21 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T16600, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 14600 - Illegal Unicode characters are allowed in pages
Illegal Unicode characters are allowed in pages
Status: NEW
Product: MediaWiki
Classification: Unclassified
Page editing (Other open bugs)
unspecified
All All
: Low normal (vote)
: ---
Assigned To: Nobody - You can work on this!
http://ru.wikipedia.org/w/index.php?t...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-06-20 14:39 UTC by Amir E. Aharoni
Modified: 2012-12-10 20:52 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Amir E. Aharoni 2008-06-20 14:39:18 UTC
Take a good look at this diff from the Russian Wikipedia: http://ru.wikipedia.org/w/index.php?title=%D0%91%D0%B8%D1%82%D0%B2%D0%B0_%D0%B7%D0%B0_%D0%9A%D0%B0%D0%B2%D0%BA%D0%B0%D0%B7_(1942%E2%80%941943)&diff=9212389&oldid=9112075

What was fixed is four instances of the Unicode character FDD3.

I stumbled upon it when i ran a Perl script that analyzed a dump of the Russian Wikipedia. I ran several pattern matches on every page and on this page the Perl regular expression engine issued this warning: "Unicode character is illegal" (see http://perldoc.perl.org/perldiag.html ). The code chart in which this character appears indeed says this: "These codes are intended for process-internal uses, but are not permitted for interchange." (Search for FDD3 here: http://www.unicode.org/charts/About.html )

My Unicode expertise ends here. I don't know what exactly are those illegal characters. I can guess that characters that have the Noncharacter_Code_Point property are illegal, and maybe there are more. I also don't know what is the exact damage that these characters cause if saved in the MediaWiki database, but i can guess that it may cause interoperability troubles with external tools - browsers, bots, search engines, future versions of the database engine etc. It may also cause security breaches. So i suppose that there is a warning sign here and most probably it shouldn't be possible to save pages that include such characters.
Comment 1 Brion Vibber 2008-06-20 22:28:36 UTC
They're not technically illegal, but perhaps should be excluded as they wouldn't be useful.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links