Last modified: 2013-04-22 16:16:30 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T44412, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 42412 - Implement locale-specific sorting for Polish language
Implement locale-specific sorting for Polish language
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Categories (Other open bugs)
1.21.x
All All
: Normal normal (vote)
: ---
Assigned To: Bartosz Dziewoński
: i18n
Depends on:
Blocks: 30673 42413 45968
  Show dependency treegraph
 
Reported: 2012-11-24 15:24 UTC by Bartosz Dziewoński
Modified: 2013-04-22 16:16 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
PolishCollation class (1.41 KB, text/plain)
2012-11-24 15:24 UTC, Bartosz Dziewoński
Details

Description Bartosz Dziewoński 2012-11-24 15:24:09 UTC
Created attachment 11415 [details]
PolishCollation class

Polish Wikipedia needs correct Polish language sorting.

The IcuCollation class (that can be enabled using $wgCategoryCollation = 'uca-default') doesn't cut it – regular ASCII letters and their variants with diacritics are treated as the same ones in sorting, and are displayed under the ASCII heading together, which is incorrect in Polish.

The full Polish alphabet is: AĄBCĆDEĘFGHIJKLŁMNŃOÓPRSŚTUWYZŹŻ, in this order, that is ABCDEFGHIJKLMNOPQRSTUVWXYZ + ĄĆĘŁŃÓŚŹŻ - QVX.

IcuCollation doesn't seem to accept a 'pl' or 'pl_PL' argument, so I have implemented a small class deriving from IcuCollation (attached), but since it looks like I'm the first one to try something like this, somebody with more MW experience will have to figure out where to stick it in the code – right in includes/Collation.php or maybe in an extension? Also, I have only tested it with Latin-based alphabets – it *should* work just right with other ones, but you guys might want to test this.

Additionally, for a reason that is to me inexplicable, IcuCollation, as well as my derivation, insists on sorting articles starting with "A" under "⅍". I have no idea why this happens or how to change it properly.

Here's a testwiki where I implemented it, and two accordingly sorted categories:

* http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Test
** This one simply contains some bot-generated articles and some weird tests. It's a good basic testcase, showing interactions between ASCII characters, Polish diacritics and other non-ASCII chars (I used some German umlauted letters).

* http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Polscy_aktorzy_filmowi
** This is simply one full category imported from pl.wikipedia, containing Polish movie actors. On http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Polscy_aktorzy_filmowi&pageuntil=Machnicki%2C+Ireneusz%0AIreneusz+Machnicki one can see the behavior of "L" and "Ł".

As mentioned before, the collations class is attached; to make it work, I simply patched includes/Collation.php, adding `case 'polish': return new PolishCollation( 'root' );` to Collation::factory, setting $wgLanguageCode = "pl"; and $wgCategoryCollation = 'polish';.

I'd greatly appreciate feedback, or maybe a simpler, less hacky solution :)

(See also: bug 29788, the same thing for Swedish)
Comment 1 Bawolff (Brian Wolff) 2012-11-24 19:49:09 UTC
Hi. Thanks for the patch. [I don't have time to look at it right now, but hopefully me or someone else will shortly].

>Additionally, for a reason that is to me inexplicable, IcuCollation, as well as
>my derivation, insists on sorting articles starting with "A" under "⅍". I have
>no idea why this happens or how to change it properly.

This is a guess, but: For the IcuCollation, there is a first-letter.ser (named something like that) file which determines how things are sorted under first letter. This file needs to be kept consist with the collation. Presumably you're collation is using that file, when a new one needs to be generated for it (There's scripts to generate the file in the maintinance sub directory. Off the top of my head I'm unsure how tied they are to ICUCollation or if they would work directly with yours).
Comment 2 Bartosz Dziewoński 2013-01-08 17:46:08 UTC
I just noticed that my patch has a pretty stupid bug – it only handles first letters of words for sorting, instead of every letter. Some unicode-aware preg_replace would fix this.
Comment 3 Bawolff (Brian Wolff) 2013-01-08 19:39:44 UTC
>Additionally, for a reason that is to me inexplicable, IcuCollation, as well as
>my derivation, insists on sorting articles starting with "A" under "⅍". I have
>no idea why this happens or how to change it properly.

btw, for anyone else following along - that issue is split to bug 43740
-----
Actual patch:

I don't think this is the way that extending uca-default was envisioned of working. I believe the intention was more to use intl's built in support for different language collations [via new IcuCollation( 'pl' ) ] and add a first-letters-pl.ser file. This would probably integrate the sorting of polish letters with other types of letters more seamlessly. However, how to generate a first-letters-pl.ser file is a bit of an open question at the moment, and probably requires a much expanded maintenance/languages/generateCollationData.php file. [Although I have a vague idea how to make one that would imperfectly, but probably acceptably by hand]

----

With the actual patch, the getFirstLetter takes a binary string. There's no guarantee that the icu collation won't use a binary code that is also a code point for one of the "polish" letters (in practise that's probably rare though). Additionally since its a binary string, its not guaranteed to be valid UTF-8, and I'm not sure how mb_substr would handle invalid utf-8.
Comment 4 Bartosz Dziewoński 2013-01-08 20:54:49 UTC
(In reply to comment #3)
> I believe the intention was more to use intl's built in support for
> different language collations [via new IcuCollation( 'pl' ) ] and add a
> first-letters-pl.ser file. 

As noted in the zeroth comment, IcuCollation doesn't seem to accept a 'pl' or 'pl_PL' argument for me. I tried that first, of course.

Does either of these work for you? Maybe I just need more PHP libs...
Comment 5 Bartosz Dziewoński 2013-01-08 21:30:39 UTC
(In reply to comment #4)
> As noted in the zeroth comment, IcuCollation doesn't seem to accept a 'pl' or
> 'pl_PL' argument for me. I tried that first, of course.

Per out IRC discussion, it turns out it actually does, but it also needs the .ser file with a 'pl' suffix. This makes the attachment essentially useless.
Comment 6 Bawolff (Brian Wolff) 2013-01-10 20:10:03 UTC
I just realized I said some incorrect things in comment 3.

(In reply to comment #3)
> 
> With the actual patch, the getFirstLetter takes a binary string. There's no
> guarantee that the icu collation won't use a binary code that is also a code
> point for one of the "polish" letters (in practise that's probably rare
> though). Additionally since its a binary string, its not guaranteed to be
> valid
> UTF-8, and I'm not sure how mb_substr would handle invalid utf-8.

This is all incorrect. getFirstLetter takes the human readable sortkey.

>This makes the attachment essentially useless.

Not entirely useless, after all the patch works now, where we don't have .ser files for pl yet. But Ideally we'd be doing this with the .ser file.
Comment 7 Bartosz Dziewoński 2013-02-18 22:51:46 UTC
So the patch here is crap, but I submitted I838484b9 to fix this properly.
Comment 8 Bartosz Dziewoński 2013-02-26 20:12:39 UTC
Merged.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links