Last modified: 2010-03-28 03:10:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T16952, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 14952 - Character reference link can generate unreachable non-NFC title
Character reference link can generate unreachable non-NFC title
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
All All
: Normal minor (vote)
: ---
Assigned To: Nobody - You can work on this!
: 19451 (view as bug list)
Depends on:
Blocks: unicode
  Show dependency treegraph
Reported: 2008-07-28 05:26 UTC by Tim Starling
Modified: 2010-03-28 03:10 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Tim Starling 2008-07-28 05:26:47 UTC
MediaWiki converts to Unicode "normal form C" (NFC) on input, but Sanitizer::decodeCharReferences() does not necessarily return NFC. A link like "[[Ω]]" generates a Title object which points to the non-NFC character in question (U+2126), and will be a red link, but due to comprehensive NFC conversion on input, clicking the red link will take you to the edit page of U+03C9.

I suggest normalising the output of Sanitizer::decodeCharReferences(), assuming that can be done efficiently. Note that Title::newFromText() is quite hot, performance-wise, for some callers.

This was reported on the English Wikipedia's village pump by [[User:Caerwine]], who does not wish to create a bugzilla account.
Comment 1 Brion Vibber 2008-07-29 00:05:53 UTC
My impression is that sticking normalization on all decodes could be pretty slow, however if we only need to normalize *when something gets expanded* it could be made relatively efficient...

In theory we could optimize by only applying normalization on the individual bits that are expanded -- but we also need the preceding char(s) to deal with combining characters, which doesn't play nicely with the way it's currently implemented (preg_replace callbacks on individual char reference sequences).

The ASCII breakdowns in the normalizer mean that an unoptimized call would still look relatively efficient for English, but could be *enormously* slow for non-Latin languages, especially Korean. (Korean is extra pain because every hangul character has to be unpacked into jamo and repacked.)

Adding Unicode tracking bug 3969.
Comment 2 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-07-29 19:58:58 UTC
Doesn't the normalizer fall back to if available?  If that's acceptably fast, then as a quick fix, decodeCharReferences() could do normalization if that's available and not otherwise.  (Does Wikimedia use that?  It's mentioned in includes/normal/README.)
Comment 3 Conrad Irwin 2010-03-28 02:35:54 UTC
*** Bug 19451 has been marked as a duplicate of this bug. ***
Comment 4 Conrad Irwin 2010-03-28 03:10:55 UTC
&'s in links are incredibly rare (838/11370705 on enwiktionary, 248/7144150 on kowiktionary - approximately 0.005%) - a naive count which includes anything in a [[ ]], i.e. categories, images and interwikis, while excluding anything that a template might add.

I have thus implemented (r64283) the only normalize *when something gets expanded* option from brion above, it is possible that additional checks could be added, but it seems likely they would slow down the 99.99% of cases where no expansion is needed.

Note You need to log in before you can comment on or make changes to this bug.