Last modified: 2014-09-24 01:23:45 UTC
This is a proposal for replacing the current implementation of UtfNormal with a faster one. The class is meant as a drop-in replacement and no change is required to MediaWiki's files except for some test files relying on private methods. (Utf8Test.php, see below) The methods implemented are: - cleanup - toNFC - toNFKC - toNFD - toNFKD To the best of my knowledge, the only functionnal difference with current implementation is that all the methods checks for the well-formedness of input and replace ill-formed UTF with replacement chars (on output, the original string is always left untouched). The code, which was developped on PHP 5.1.2 then tested on 4.4.2, passes both UtfNormalTest.php's and Utf8Test.php's tests. I don't have access to the utfnormal PHP extension, therefore I could only rely on original comments to anticipate its behaviour. (see "UnicodeString constructor" in file's comments) Note that Utf8Test.php needs to be edited in order to work with this new implementation, so here is an informal diff: - $stripped = $line; - UtfNormal::quickisNFCVerify( $stripped ); + $stripped = UtfNormal::toNFC( $line ); The expected performance improvement is between 5% (when using the utfnormal PHP extension) and somewhere above 1000% (without the PHP extension) for some edge cases involving long ASCII text and some Unicode chars whose NFC_QC property value is "Maybe" or "No". For the record, during testing some texts (taken from a dump of the fr wikipedia) were benchmarked at 80x the original speed. :) Note that UtfNormalGenerate.php must be run to regenerate the data files before use.
Created attachment 1455 [details] UtfNormal.php
Created attachment 1456 [details] UtfNormalGenerate.php
Lost in the mists of time. :( Will take another look over this! :D
This code still works (with very minor modifications). I've attached a patch that should apply cleanly on the 1.17 branch (and most probably on trunk too). The claim of ~5% improvement in time also still checks out (I didn't check with the php extension disabled though). Crude kcachegrind screenshots of a profiled run of dumpBackup.php with and without the patch here[1]. make test passes. [1]: http://imgur.com/jxDxX&fVF35l
Created attachment 8300 [details] Patch to make it apply cleanly 5 years after original was written
Comment on attachment 8300 [details] Patch to make it apply cleanly 5 years after original was written Contains the autogenerated data file as well.
Created attachment 8308 [details] Smaller, Cleaner Partial Diff Apply, then copy https://bugzilla.wikimedia.org/attachment.cgi?id=1456 and https://bugzilla.wikimedia.org/attachment.cgi?id=1455 to includes/normal.
apergos (Ariel) ran two medium sized dumps (~5 hours on wmf clusters) and found a ~3% performance improvement with the new UtfNormal.
Testing without php5-intl on a smallish dump (~300000 revs) gives around 8.5x improvement.
Created attachment 8325 [details] Patch with all defines moved to UtfNormalDefines.php Removed a bunch of warnings.
Created attachment 8326 [details] Clean Patch, without generated files.
Patch applied in r84706.
Reverted in r84842, throws enormous numbers of E_STRICT errors on TranslateWiki. Please see comments on CodeReview for r84706 and r84709.
Created attachment 8342 [details] Patch to fix issues mentioned in CR Fixes issues mentioned in http://www.mediawiki.org/wiki/Special:Code/MediaWiki/84706
Created attachment 8343 [details] Patch to fix issues mentioned in CR Forgot to add documentation in previous patch.
Created attachment 8353 [details] Patch to fix issues mentioned in CR Re-added Brion's copyright header, as per Reedy's suggestion.
Yuvi, I'm sorry no one reviewed your patch in the last several months. Sadly, it no longer applies cleanly to trunk. Do you possibly have time to update it? Since you now have commit access, you could just commit your updated patch into SVN.
@Sumana: I think consensus when this was 'abandoned' was that this was a third-hand patch (someone wrote it originally, I just cleaned it up, and then someone else would have to review/commit this). And that made it too untrustworthy/fragile in terms of quality to be accepted. I doubt that has changed?
Should I WONTFIX this? Or leave it open for some other brave soul?