Last modified: 2014-09-24 01:23:45 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T7303, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 5303 - UtfNormal replacement proposal
UtfNormal replacement proposal
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch, patch-reviewed
Depends on:
  Show dependency treegraph
Reported: 2006-03-21 02:06 UTC by Ludovic ARNAUD
Modified: 2014-09-24 01:23 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---

UtfNormal.php (47.82 KB, text/x-php)
2006-03-21 02:07 UTC, Ludovic ARNAUD
UtfNormalGenerate.php (8.93 KB, text/x-php)
2006-03-21 02:07 UTC, Ludovic ARNAUD
Patch to make it apply cleanly 5 years after original was written (478.55 KB, patch)
2011-03-16 18:43 UTC, Yuvi Panda
Smaller, Cleaner Partial Diff (2.46 KB, patch)
2011-03-17 19:01 UTC, Yuvi Panda
Patch with all defines moved to UtfNormalDefines.php (478.78 KB, patch)
2011-03-24 19:14 UTC, Yuvi Panda
Clean Patch, without generated files. (90.44 KB, patch)
2011-03-24 20:26 UTC, Yuvi Panda
Patch to fix issues mentioned in CR (90.43 KB, patch)
2011-03-28 21:25 UTC, Yuvi Panda
Patch to fix issues mentioned in CR (91.38 KB, patch)
2011-03-28 21:35 UTC, Yuvi Panda
Patch to fix issues mentioned in CR (91.48 KB, patch)
2011-03-30 21:27 UTC, Yuvi Panda

Description Ludovic ARNAUD 2006-03-21 02:06:02 UTC
This is a proposal for replacing the current implementation of UtfNormal with a
faster one. The class is meant as a drop-in replacement and no change is
required to MediaWiki's files except for some test files relying on private
methods. (Utf8Test.php, see below)

The methods implemented are:
- cleanup
- toNFC
- toNFKC
- toNFD
- toNFKD

To the best of my knowledge, the only functionnal difference with current
implementation is that all the methods checks for the well-formedness of input
and replace ill-formed UTF with replacement chars (on output, the original
string is always left untouched).

The code, which was developped on PHP 5.1.2 then tested on 4.4.2, passes both
UtfNormalTest.php's and Utf8Test.php's tests. I don't have access to the
utfnormal PHP extension, therefore I could only rely on original comments to
anticipate its behaviour. (see "UnicodeString constructor" in file's comments)
Note that Utf8Test.php needs to be edited in order to work with this new
implementation, so here is an informal diff:
-       $stripped = $line;
-       UtfNormal::quickisNFCVerify( $stripped );
+       $stripped = UtfNormal::toNFC( $line );

The expected performance improvement is between 5% (when using the utfnormal PHP
extension) and somewhere above 1000% (without the PHP extension) for some edge
cases involving long ASCII text and some Unicode chars whose NFC_QC property
value is "Maybe" or "No". For the record, during testing some texts (taken from
a dump of the fr wikipedia) were benchmarked at 80x the original speed. :)

Note that UtfNormalGenerate.php must be run to regenerate the data files before use.
Comment 1 Ludovic ARNAUD 2006-03-21 02:07:25 UTC
Created attachment 1455 [details]
Comment 2 Ludovic ARNAUD 2006-03-21 02:07:48 UTC
Created attachment 1456 [details]
Comment 3 Brion Vibber 2008-12-16 01:52:27 UTC
Lost in the mists of time. :(

Will take another look over this! :D
Comment 4 Yuvi Panda 2011-03-16 18:41:56 UTC
This code still works (with very minor modifications). I've attached a patch that should apply cleanly on the 1.17 branch (and most probably on trunk too). 

The claim of ~5% improvement in time also still checks out (I didn't check with the php extension disabled though). Crude kcachegrind screenshots of a profiled run of dumpBackup.php with and without the patch here[1].

make test passes.

Comment 5 Yuvi Panda 2011-03-16 18:43:43 UTC
Created attachment 8300 [details]
Patch to make it apply cleanly 5 years after original was written
Comment 6 Yuvi Panda 2011-03-16 18:47:51 UTC
Comment on attachment 8300 [details]
Patch to make it apply cleanly 5 years after original was written

Contains the autogenerated data file as well.
Comment 7 Yuvi Panda 2011-03-17 19:01:06 UTC
Created attachment 8308 [details]
Smaller, Cleaner Partial Diff

Apply, then copy and to includes/normal.
Comment 8 Yuvi Panda 2011-03-23 20:12:50 UTC
apergos (Ariel) ran two medium sized dumps (~5 hours on wmf clusters) and found a ~3% performance improvement with the new UtfNormal.
Comment 9 Yuvi Panda 2011-03-24 05:27:42 UTC
Testing without php5-intl on a smallish dump (~300000 revs) gives around 8.5x improvement.
Comment 10 Yuvi Panda 2011-03-24 19:14:33 UTC
Created attachment 8325 [details]
Patch with all defines moved to UtfNormalDefines.php

Removed a bunch of warnings.
Comment 11 Yuvi Panda 2011-03-24 20:26:43 UTC
Created attachment 8326 [details]
Clean Patch, without generated files.
Comment 12 Happy-melon 2011-03-24 20:53:23 UTC
Patch applied in r84706.
Comment 13 Happy-melon 2011-03-27 12:23:53 UTC
Reverted in r84842, throws enormous numbers of E_STRICT errors on TranslateWiki.  Please see comments on CodeReview for r84706 and r84709.
Comment 14 Yuvi Panda 2011-03-28 21:25:34 UTC
Created attachment 8342 [details]
Patch to fix issues mentioned in CR

Fixes issues mentioned in
Comment 15 Yuvi Panda 2011-03-28 21:35:33 UTC
Created attachment 8343 [details]
Patch to fix issues mentioned in CR

Forgot to add documentation in previous patch.
Comment 16 Yuvi Panda 2011-03-30 21:27:21 UTC
Created attachment 8353 [details]
Patch to fix issues mentioned in CR

Re-added Brion's copyright header, as per Reedy's suggestion.
Comment 17 Sumana Harihareswara 2011-11-06 14:13:08 UTC
Yuvi, I'm sorry no one reviewed your patch in the last several months.  Sadly, it no longer applies cleanly to trunk.  Do you possibly have time to update it?  Since you now have commit access, you could just commit your updated patch into SVN.
Comment 18 Yuvi Panda 2011-11-10 18:38:58 UTC
@Sumana: I think consensus when this was 'abandoned' was that this was a third-hand patch (someone wrote it originally, I just cleaned it up, and then someone else would have to review/commit this). And that made it too untrustworthy/fragile in terms of quality to be accepted. I doubt that has changed?
Comment 19 Yuvi Panda 2013-04-22 19:55:11 UTC
Should I WONTFIX this? Or leave it open for some other brave soul?

Note You need to log in before you can comment on or make changes to this bug.