Last modified: 2013-07-25 07:05:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48330, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 46330 - Set $wgCategoryCollation to 'uca-fi' on Finnish wikis and rebuild category sort keys
Set $wgCategoryCollation to 'uca-fi' on Finnish wikis and rebuild category so...
Status: VERIFIED FIXED
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
unspecified
All All
: Normal enhancement with 2 votes (vote)
: ---
Assigned To: Bartosz Dziewoński
: shell
Depends on:
Blocks: collations
  Show dependency treegraph
 
Reported: 2013-03-19 18:34 UTC by Mikko Silvonen
Modified: 2013-07-25 07:05 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Mikko Silvonen 2013-03-19 18:34:15 UTC
Please set $wgCategoryCollation to 'uca-fi' and rebuild category sort keys on all Finnish wikis except Wiktionary, i.e. fi.wikipedia, fi.wikisource, fi.wikibooks, fi.wikiversity, fi.wikiquote and fi.wikinews.

Is this feature already mature enough to be just deployed, or should it be tested in advance?

Community discussions/notifications:

http://fi.wikipedia.org/wiki/Wikipedia:Kahvihuone_(tekniikka)#.C3.84.C3.A4kk.C3.B6set_vihdoin_oikeaan_j.C3.A4rjestykseen
http://fi.wikisource.org/wiki/Wikiaineisto:Kahvihuone#Suomalainen_aakkosj.C3.A4rjestys_k.C3.A4ytt.C3.B6.C3.B6n_.28.C3.84.C3.85.C3.96_po._.C3.85.C3.84.C3.96.29
http://fi.wikibooks.org/wiki/Wikikirjasto:Kahvihuone#Suomalainen_aakkosj.C3.A4rjestys_k.C3.A4ytt.C3.B6.C3.B6n_.28.C3.84.C3.85.C3.96_po._.C3.85.C3.84.C3.96.29
http://fi.wikiversity.org/wiki/Wikiopisto:Kahvihuone#Suomalainen_aakkosj.C3.A4rjestys_k.C3.A4ytt.C3.B6.C3.B6n_.28.C3.84.C3.85.C3.96_po._.C3.85.C3.84.C3.96.29
http://fi.wikiquote.org/wiki/Wikisitaatit:Kahvihuone#Suomalainen_aakkosj.C3.A4rjestys_k.C3.A4ytt.C3.B6.C3.B6n_.28.C3.84.C3.85.C3.96_po._.C3.85.C3.84.C3.96.29
http://fi.wikinews.org/wiki/Wikiuutiset:Kahvihuone#Suomalainen_aakkosj.C3.A4rjestys_k.C3.A4ytt.C3.B6.C3.B6n_.28.C3.84.C3.85.C3.96_po._.C3.85.C3.84.C3.96.29

The smaller projects are pretty quiet at the moment, so I may not receive any responses to such a no-brainer bug fix proposal, but the Wikipedia community is already becoming impatient and asking why this wasn't fixed years ago. :) Thank you in advance!
Comment 1 Bartosz Dziewoński 2013-03-19 22:18:27 UTC
This will probably have to wait a few days, since there is a couple of such configuration changes in progress or queued right now, and processing the pages for a semi-large wiki like fi.wikipedia can take multiple hours.

(In reply to comment #0)
> Is this feature already mature enough to be just deployed, or should it be
> tested in advance?

The ICU library used for the actual sorting here is mature and stable. However, strange interactions with the code in MediaWiki are not impossible, as seen in bug 45446 comment 6 (although they are unlikely). At any rate, I created a testwiki with these settings for you at http://users.v-lo.krakow.pl/~matmarex/testwiki-fi/ , feel free to link it on the wikis and edit there to see how it behaves (but be aware that the wiki won't stay up forever after this bug is closed).
Comment 2 Mikko Silvonen 2013-03-20 05:37:58 UTC
Thanks! There is a grouping problem in Finnish, too: Words starting with T are shown under the Northern Sami letter "Ŧ" instead of "T". This must be fixed before the deployment.

http://users.v-lo.krakow.pl/~matmarex/testwiki-fi/index.php?title=Luokka:Aakkosj%C3%A4rjestys

I'll create a more comprehensive test suite to check if there are any other problems. It would also be nice to know which standard the ICU implementation is supposed to comply with (my guess: SFS-EN 13710). There are a couple of slightly different standards.
Comment 3 Mikko Silvonen 2013-03-20 18:03:22 UTC
Two more problems: The test word "Žukov" is incorrectly shown under "Ʒ" and the word "Nguyen" under "Ŋ". Ž should be equivalent with Z, and Ng should of course be sorted under N.

I wonder if there is some fundamental flaw with the grouping of letters under these one-letter headers?
Comment 4 Bartosz Dziewoński 2013-03-20 19:19:19 UTC
(In reply to comment #2)
> It would also be nice to know which standard the ICU implementation
> is supposed to comply with (my guess: SFS-EN 13710). There are a couple of
> slightly different standards.

I have no idea, to be honest. Wikimedia wikis are currently running ICU 4.8 (per bug 46036); that's all the information I can give you :)

The data used to "partition" the sorted list into headers is probably not standardised at all and somehow based on the information about primary-level collation data. For details you should probably look at the code that generates it, maintenance/language/generateCollationData.php. 


(In reply to comment #3)
> I wonder if there is some fundamental flaw with the grouping of letters under
> these one-letter headers?

I don't think there's such a "fundamental flaw" in it; the list is generated using generalised data that's reasonably correct for most languages, and thus needs such modifications for some specific ones. For example, no modifications were needed for Portuguese, and Polish only required adding the appropriate letters with diacritics.

You and Swedes are just unlucky, I suppose :) It's interesting how those characters are sorted among Latin letters in Finnish, and at the end of the Latin alphabet in Polish or Portuguese.

I automatically created a category with all two-letter combinations of ASCII letters + Å, Ä, Ö: http://users.v-lo.krakow.pl/~matmarex/testwiki-fi/index.php?title=Luokka:Autotest . It seems like we need to exclude those four characters: Ǥ, Ŋ, Ŧ, Ʒ. I'll submit a patch to do this later today.
Comment 5 Bartosz Dziewoński 2013-03-20 23:19:21 UTC
Submitted the patch: I976dedfd and deployed it on my test wiki (you might need to action=purge the category pages to see it).
Comment 6 Bawolff (Brian Wolff) 2013-03-21 00:09:56 UTC
That's kind of weird. "Ŧ" should be primary different from T (according to a chart for icu 4.2 [1], maybe it changed in later versions) which means that they should each have there own section with things starting with T being labelled under T.

In comparison, in swedish the issue was with expansions - note the dark grey background of thorn in [2]

[1] http://collation-charts.org/icu442/icu442-fi.html

[2] http://collation-charts.org/icu442/icu442-fi.html
Comment 7 Bartosz Dziewoński 2013-03-21 00:13:14 UTC
(Note that Wikimedia wikis are currently running ICU 4.8 per bug 46036.)
Comment 8 Mikko Silvonen 2013-03-21 04:25:05 UTC
Thank you! The grouping looks good in the test categories, and I haven't seen any problems with the underlying sort order.

According to SFS-EN 13710 (derived from EN 13710:2011), the first-level Latin letters are A...ZÞÅÄÖ in Finnish. Ŧ is defined as a second-level letter equivalent to T.
Comment 9 Mikko Silvonen 2013-03-21 05:37:20 UTC
Some "exotic" characters (e.g. Ƕ, Ə and Ƭ) are still treated as first-level letters, but this could be a feature of the ICU library. EN 13710:2011 defines these three characters as second-level letters equivalent to HV, E and T.

I don't see this as a release blocker.
Comment 10 Bawolff (Brian Wolff) 2013-03-23 20:48:45 UTC
(In reply to comment #6)
> That's kind of weird. "Ŧ" should be primary different from T (according to a
> chart for icu 4.2 [1], maybe it changed in later versions) which means that
> they should each have there own section with things starting with T being
> labelled under T.
> 
> In comparison, in swedish the issue was with expansions - note the dark grey
> background of thorn in [2]
> 
> [1] http://collation-charts.org/icu442/icu442-fi.html
> 
> [2] http://collation-charts.org/icu442/icu442-fi.html

I think I figured out what was happening.

Ŧ is tailored to be secondary different from T̵ (aka T plus a U+335 COMBINING SHORT STROKE OVERLAY . The U+335 should be primary ignorable. So in essence this is secondary different from plain T). Since that is 2 letters its like an expansion, which our primary collision code doesn't handle properly.
Comment 11 Bawolff (Brian Wolff) 2013-03-23 21:48:47 UTC
(In reply to comment #5)
> Submitted the patch: I976dedfd and deployed it on my test wiki (you might
> need
> to action=purge the category pages to see it).

btw, now merged.
Comment 12 Mikko Silvonen 2013-04-02 17:43:54 UTC
When can we deploy this? I'd like to notify the Finnish community about the schedule.
Comment 13 Bartosz Dziewoński 2013-04-02 18:23:34 UTC
I just noticed there is also https://fi.wikimedia.org/wiki/Etusivu - I assume it should be covered by the change as well?
Comment 14 Bartosz Dziewoński 2013-04-02 18:27:45 UTC
Submitted a patch including fiwikimedia as Ia40f5b89. I'm a volunteer myself, so I can't tell you when it will be deployed - likely within a week or so, probably quicker.
Comment 15 Mikko Silvonen 2013-04-03 05:07:44 UTC
Thank you! Yes, Wikimedia Finland should be included, although this particular site might never have content affected by this bug. (Swedish names starting with Å have been the biggest problem with the old sort order.)
Comment 16 Sam Reed (reedy) 2013-04-03 12:55:55 UTC
All done
Comment 17 Mikko Silvonen 2013-04-03 13:45:21 UTC
Was the patch mentioned in comment 5 included?

When I view the page http://fi.wikipedia.org/wiki/Luokka:Ruotsin_kaupungit , the letters Å, Ä and Ö are now in the correct order, but the G, N and T sections are incorrectly labelled as Ǥ, Ŋ and Ŧ.
Comment 18 Mikko Silvonen 2013-04-03 16:41:36 UTC
Reopening until the single-letter headings are displayed correctly.
Comment 19 Bartosz Dziewoński 2013-04-03 16:44:03 UTC
(In reply to comment #17)
> Was the patch mentioned in comment 5 included?

... it wasn't. Sorry, that was a stupid oversight :) The backport to 1.21wmf12 is I976dedfd, Reedy is working to get it deployed.
Comment 20 Alex Monk 2013-04-04 16:18:35 UTC
Looks like this is done now, marking as resolved fixed.
Comment 21 Mikko Silvonen 2013-04-10 11:50:42 UTC
I thank you, good people. The categories look good, and I haven't seen any complaints from any project (just checked the discussion threads). Marking as verified.

The Finnish Wiktionary community is still discussing their sorting needs and might submit a new request later:
http://fi.wiktionary.org/wiki/Wikisanakirja:Kahvihuone#Wikisanakirjan_aakkosj.C3.A4rjestys

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links