Last modified: 2010-05-15 15:33:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 240 - Need to perform Unicode normalization
Need to perform Unicode normalization
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
All All
: Normal normal with 1 vote (vote)
: ---
Assigned To: Brion Vibber
Depends on:
Blocks: html 215
  Show dependency treegraph
Reported: 2004-08-28 18:14 UTC by Brion Vibber
Modified: 2010-05-15 15:33 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Brion Vibber 2004-08-28 18:14:26 UTC
We need to ensure that UTF-8 input is:
* Valid UTF-8 (strip broken chars)
* Valid for XML output (strip illegal control characters)
* In sensible normalization (form C)

In some cases we may need to normalize on output as well, due to old
data being corrupt. Or, we can do a one-time pass on the database
to clean it up.
Comment 1 Brion Vibber 2004-08-28 18:16:29 UTC
I'm working on a 'pure PHP' normalizer, though it's likely to be relatively slow. If too bad we may want a DSO extension for high-performance 
sites (see comments in bug 215 on some external resources).
Comment 2 Brion Vibber 2004-08-29 10:34:34 UTC
I've checked in some more or less functional normalization routines, in includes/normal.

Probably will want to have WebRequest call UtfNormal::toNFC() on input, or at least some input.
And/or put it in title/username normalization.

Additionally we'll want to check for broken UTF-8; these routines can probably be extended to do that too.
Comment 3 Brion Vibber 2004-09-03 07:16:34 UTC
Now checking all(?) input for broken UTF-8 and normalizing to form C on input. Could use optimization but that's a 
separate issue. :)

Checked into CVS for 1.4.

Note You need to log in before you can comment on or make changes to this bug.