Last modified: 2010-03-28 03:10:55 UTC
MediaWiki converts to Unicode "normal form C" (NFC) on input, but Sanitizer::decodeCharReferences() does not necessarily return NFC. A link like "[[Ω]]" generates a Title object which points to the non-NFC character in question (U+2126), and will be a red link, but due to comprehensive NFC conversion on input, clicking the red link will take you to the edit page of U+03C9. I suggest normalising the output of Sanitizer::decodeCharReferences(), assuming that can be done efficiently. Note that Title::newFromText() is quite hot, performance-wise, for some callers. This was reported on the English Wikipedia's village pump by [[User:Caerwine]], who does not wish to create a bugzilla account.
My impression is that sticking normalization on all decodes could be pretty slow, however if we only need to normalize *when something gets expanded* it could be made relatively efficient... In theory we could optimize by only applying normalization on the individual bits that are expanded -- but we also need the preceding char(s) to deal with combining characters, which doesn't play nicely with the way it's currently implemented (preg_replace callbacks on individual char reference sequences). The ASCII breakdowns in the normalizer mean that an unoptimized call would still look relatively efficient for English, but could be *enormously* slow for non-Latin languages, especially Korean. (Korean is extra pain because every hangul character has to be unpacked into jamo and repacked.) Adding Unicode tracking bug 3969.
Doesn't the normalizer fall back to php_normal.so if available? If that's acceptably fast, then as a quick fix, decodeCharReferences() could do normalization if that's available and not otherwise. (Does Wikimedia use that? It's mentioned in includes/normal/README.)
*** Bug 19451 has been marked as a duplicate of this bug. ***
&'s in links are incredibly rare (838/11370705 on enwiktionary, 248/7144150 on kowiktionary - approximately 0.005%) - a naive count which includes anything in a [[ ]], i.e. categories, images and interwikis, while excluding anything that a template might add. I have thus implemented (r64283) the only normalize *when something gets expanded* option from brion above, it is possible that additional checks could be added, but it seems likely they would slow down the 99.99% of cases where no expansion is needed.