Last modified: 2010-05-15 15:33:55 UTC
We need to ensure that UTF-8 input is:
* Valid UTF-8 (strip broken chars)
* Valid for XML output (strip illegal control characters)
* In sensible normalization (form C)
In some cases we may need to normalize on output as well, due to old
data being corrupt. Or, we can do a one-time pass on the database
to clean it up.
I'm working on a 'pure PHP' normalizer, though it's likely to be relatively slow. If too bad we may want a DSO extension for high-performance
sites (see comments in bug 215 on some external resources).
I've checked in some more or less functional normalization routines, in includes/normal.
Probably will want to have WebRequest call UtfNormal::toNFC() on input, or at least some input.
And/or put it in title/username normalization.
Additionally we'll want to check for broken UTF-8; these routines can probably be extended to do that too.
Now checking all(?) input for broken UTF-8 and normalizing to form C on input. Could use optimization but that's a
separate issue. :)
Checked into CVS for 1.4.