Last modified: 2013-04-22 16:15:53 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T45740, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 43740 - IcuCollation doesn't prune first letter elements that duplicate a prefix of another first letter's sortkey
IcuCollation doesn't prune first letter elements that duplicate a prefix of a...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Categories (Other open bugs)
1.21.x
All All
: High major (vote)
: 1.21.0 release
Assigned To: Nobody - You can work on this!
:
Depends on: 43801 43802 43804
Blocks: 30673
  Show dependency treegraph
 
Reported: 2013-01-08 19:09 UTC by Bawolff (Brian Wolff)
Modified: 2013-04-22 16:15 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Bawolff (Brian Wolff) 2013-01-08 19:09:08 UTC
Issue: The collation algorithm changes slightly between different versions of php's intl library. The first letters' algorithm we use to display the first letters on category pages is sensitive to these changes. Specifically the first-letters-root.ser file goes with a specific version of the icu library used with php intl extension 

Affect: Some of the category headers will be incorrect. Things may switch back and forth between different headers. For example "Aa" was sorted under a U+214D instead of the letter A in MatmaRex's test wiki.

----

What needs to be done:

*We need to get the version of php intl's extension in use. This will probably have to be done by grepping for ICU version in phpinfo()'s output

*We need to have versions of first-letters-root.ser that correspond to different versions of icu library. These are fairly easy to make (I believe). Simply need to feed maintinance/languages/generateCollationData.php different data files corresponding to different versions of unicode. (Need to double check that icu library is using the unicode files we think it is. I have a vague memory of it changing to some file from the CLDR stuff a couple versions ago)

*Some of the docs about which data files to download for generateCollationData.php need to be updated. Right now its telling people to download the latest version of one file, and the version corresponding to unicode 6.0 for another.
Comment 1 Bartosz Dziewoński 2013-01-09 23:20:11 UTC
I've split three sub-bugs based on your three points as piecemeal tasks: bug 43801, bug 43802 and bug 43804.

I'm hoping to work on them, starting with the easiest – doco one ;)
Comment 2 Bawolff (Brian Wolff) 2013-01-10 21:10:54 UTC
It looks kind of like the code in Collation.php

    // Primary collision
    // Keep whichever one sorts first in the main collator
    if ( $this->mainCollator->compare( $letter, $letterMap[$key] ) < 0 ) {
             $letterMap[$key] = $letter;
    }

Was meant to handle this situation (or I misunderstand the intention of the code)
Comment 3 Bawolff (Brian Wolff) 2013-01-11 03:26:52 UTC
(In reply to comment #2)
> It looks kind of like the code in Collation.php
> 
>     // Primary collision
>     // Keep whichever one sorts first in the main collator
>     if ( $this->mainCollator->compare( $letter, $letterMap[$key] ) < 0 ) {
>              $letterMap[$key] = $letter;
>     }
> 
> Was meant to handle this situation (or I misunderstand the intention of the
> code)

So reason that block of code didnt prevent the appearant primary key collisions caused by the icu version mismatch was because they weren't collisions.  The two cases I saw were a rupee header in the middle of the r section and an A/S symbol in the middle of the a's. Both cases the symbols would be expanded to have more than one primary weight-so they weren't exact collisions, only a prefix was equal
Comment 4 Tim Starling 2013-02-10 22:54:40 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > It looks kind of like the code in Collation.php
> > 
> >     // Primary collision
> >     // Keep whichever one sorts first in the main collator
> >     if ( $this->mainCollator->compare( $letter, $letterMap[$key] ) < 0 ) {
> >              $letterMap[$key] = $letter;
> >     }
> > 
> > Was meant to handle this situation (or I misunderstand the intention of the
> > code)
> 
> So reason that block of code didnt prevent the appearant primary key
> collisions
> caused by the icu version mismatch was because they weren't collisions.  The
> two cases I saw were a rupee header in the middle of the r section and an A/S
> symbol in the middle of the a's. Both cases the symbols would be expanded to
> have more than one primary weight-so they weren't exact collisions, only a
> prefix was equal

Yes, the preprocessing done on the first letter array in getFirstLetterData() was meant to defend against changes in ICU library version. I think the best solution to this bug would be to fix this particular case of symbols expanded to multiple primary weights, rather than to just abandon the whole design idea per the request on bug 43802. 

For the record, Matma Rex was using Gentoo Hardened 4.5.4, ICU 49.1.1 and intl 1.1.0.
Comment 5 Bartosz Dziewoński 2013-02-10 23:08:49 UTC
(This can now be seen live at http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Test , but won't stay there forever. Feel free to edit that wiki, and please disregard the spambots.)
Comment 6 Bartosz Dziewoński 2013-02-11 12:19:31 UTC
(The same also happens on ICU 50.1, which is using Unicode 6.2. 49.1.1 is using Unicode 6.1.)
Comment 7 Bartosz Dziewoński 2013-02-20 21:55:26 UTC
My I5d2a4e7e that simply removed the file was -2'd by Tim: "Unless someone can tell me why the bug Bawolff describes in comment 3 really is impossible to fix, this is a -2 for me.".

The summary of this should probably be reworded, but I'll leave that to someone who knows what they are doing.

(Raising severity to major, as this kind of breaks the feature of using ICU to collate category elements.)
Comment 8 Bawolff (Brian Wolff) 2013-03-24 03:34:56 UTC
Ok, I read up on icu, after quite a bit of googling, this actually looks not that complicated. From what I gather (if I read the docs right, which is a very big if), there should be no two primary collation elements where one collation element in its entirety is a prefix of some other collation. See https://ssl.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm specificly:
 R2. A fractional weight cannot exactly match the initial bytes of another fractional weight at the same level.

So assuming nothing funky is done to compress the sort keys (which I don't happens on the primary level, at least not currently), just looking for matching prefixes should work.

Anyhow Gerrit change #55503. I still need to double check that unsetting an element of an array doesn't modify its sorted order (It doesn't seem to, but should double check).

It also might be prettier if the check for duplicate prefix was merged into the general duplicate check, but I didn't see an easy way of doing.

Anyhow, feedback appreciated.
Comment 9 Nemo 2013-04-01 14:41:45 UTC
Setting milestone and raising priority per https://www.mediawiki.org/w/index.php?title=MediaWiki_1.21&diff=666996&oldid=666981
Comment 10 Bartosz Dziewoński 2013-04-08 23:11:09 UTC
Yay, I4bd3d39e merged. Needs backporting to 1.21 now - CC-ing Mark.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links