Last modified: 2010-05-15 15:33:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2240, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 240 - Need to perform Unicode normalization


Summary:	Need to perform Unicode normalization

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	1.4.x
Hardware:	All All

Importance:	Normal normal with 1 vote (vote)
Target Milestone:	---
Assigned To:	Brion Vibber

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	html 215
	Show dependency tree / graph

Reported:	2004-08-28 18:14 UTC by Brion Vibber
Modified:	2010-05-15 15:33 UTC (History)
CC List:	4 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Brion Vibber 2004-08-28 18:14:26 UTC

We need to ensure that UTF-8 input is:
* Valid UTF-8 (strip broken chars)
* Valid for XML output (strip illegal control characters)
* In sensible normalization (form C)

In some cases we may need to normalize on output as well, due to old
data being corrupt. Or, we can do a one-time pass on the database
to clean it up.

Comment 1 Brion Vibber 2004-08-28 18:16:29 UTC

I'm working on a 'pure PHP' normalizer, though it's likely to be relatively slow. If too bad we may want a DSO extension for high-performance 
sites (see comments in bug 215 on some external resources).

Comment 2 Brion Vibber 2004-08-29 10:34:34 UTC

I've checked in some more or less functional normalization routines, in includes/normal.

Probably will want to have WebRequest call UtfNormal::toNFC() on input, or at least some input.
And/or put it in title/username normalization.

Additionally we'll want to check for broken UTF-8; these routines can probably be extended to do that too.

Comment 3 Brion Vibber 2004-09-03 07:16:34 UTC

Now checking all(?) input for broken UTF-8 and normalizing to form C on input. Could use optimization but that's a 
separate issue. :)

Checked into CVS for 1.4.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links