Last modified: 2012-12-10 20:52:21 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 14600 - Illegal Unicode characters are allowed in pages
Illegal Unicode characters are allowed in pages
Status: NEW
Product: MediaWiki
Classification: Unclassified
Page editing (Other open bugs)
unspecified
All All
: Low normal (vote)
: ---
Assigned To: Nobody - You can work on this!
http://ru.wikipedia.org/w/index.php?t...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-06-20 14:39 UTC by Amir E. Aharoni
Modified: 2012-12-10 20:52 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Amir E. Aharoni 2008-06-20 14:39:18 UTC
Take a good look at this diff from the Russian Wikipedia: http://ru.wikipedia.org/w/index.php?title=%D0%91%D0%B8%D1%82%D0%B2%D0%B0_%D0%B7%D0%B0_%D0%9A%D0%B0%D0%B2%D0%BA%D0%B0%D0%B7_(1942%E2%80%941943)&diff=9212389&oldid=9112075

What was fixed is four instances of the Unicode character FDD3.

I stumbled upon it when i ran a Perl script that analyzed a dump of the Russian Wikipedia. I ran several pattern matches on every page and on this page the Perl regular expression engine issued this warning: "Unicode character is illegal" (see http://perldoc.perl.org/perldiag.html ). The code chart in which this character appears indeed says this: "These codes are intended for process-internal uses, but are not permitted for interchange." (Search for FDD3 here: http://www.unicode.org/charts/About.html )

My Unicode expertise ends here. I don't know what exactly are those illegal characters. I can guess that characters that have the Noncharacter_Code_Point property are illegal, and maybe there are more. I also don't know what is the exact damage that these characters cause if saved in the MediaWiki database, but i can guess that it may cause interoperability troubles with external tools - browsers, bots, search engines, future versions of the database engine etc. It may also cause security breaches. So i suppose that there is a warning sign here and most probably it shouldn't be possible to save pages that include such characters.
Comment 1 Brion Vibber 2008-06-20 22:28:36 UTC
They're not technically illegal, but perhaps should be excluded as they wouldn't be useful.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links