Last modified: 2010-05-15 15:33:55 UTC
We need to ensure that UTF-8 input is: * Valid UTF-8 (strip broken chars) * Valid for XML output (strip illegal control characters) * In sensible normalization (form C) In some cases we may need to normalize on output as well, due to old data being corrupt. Or, we can do a one-time pass on the database to clean it up.
I'm working on a 'pure PHP' normalizer, though it's likely to be relatively slow. If too bad we may want a DSO extension for high-performance sites (see comments in bug 215 on some external resources).
I've checked in some more or less functional normalization routines, in includes/normal. Probably will want to have WebRequest call UtfNormal::toNFC() on input, or at least some input. And/or put it in title/username normalization. Additionally we'll want to check for broken UTF-8; these routines can probably be extended to do that too.
Now checking all(?) input for broken UTF-8 and normalizing to form C on input. Could use optimization but that's a separate issue. :) Checked into CVS for 1.4.