Last modified: 2014-11-18 18:07:10 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47596, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 45596 - Set $wgCategoryCollation to 'uca-hu' on Hungarian Wikipedia and rebuild category sort keys
Set $wgCategoryCollation to 'uca-hu' on Hungarian Wikipedia and rebuild categ...
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
unspecified
All All
: Normal enhancement with 2 votes (vote)
: ---
Assigned To: Bartosz Dziewoński
: shell
Depends on: 46036
Blocks: collations
  Show dependency treegraph
 
Reported: 2013-03-01 08:20 UTC by Tisza Gergő
Modified: 2014-11-18 18:07 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tisza Gergő 2013-03-01 08:20:43 UTC
Per bug 45443, category collation should be localized on hu.wikipedia.
Comment 1 Tisza Gergő 2013-03-01 08:24:31 UTC
I reviewed Collation::$tailoringFirstLetters['hu'], it contains exactly the non-ASCII letters which can be used for first letter grouping in Hungarian.

Additionally, for all long-short vowel pairs (a – á, e – é, i – í, o – ó, ö – ő, u – ú, ü – ű) the long vowel should be treated as if it were the short one (e.g. the word "álom" should be listed under A, or the word "űr" under Ü). Since the Hungarian collation treats these pairs as equivalent, I suppose that is done automatically?
Comment 2 Tisza Gergő 2013-03-01 08:27:39 UTC
For reference, Collation::$tailoringFirstLetters['hu'] contains this list:

"CS", "DZ", "DZS", "GY", "LY", "NY", "Ö", "SZ", "TY", "Ü", "ZS"
Comment 3 Bartosz Dziewoński 2013-03-01 10:49:30 UTC
I set up a testwiki in Hungarian with uca-hu collation enabled for you: http://users.v-lo.krakow.pl/~matmarex/testwiki-hu

Feel free to link it on-wiki and use it however you want, just be aware that it won't stay up forever after this bug is closed :)
Comment 4 Bartosz Dziewoński 2013-03-01 10:55:42 UTC
And I filled a test category with some letters and symbols: http://users.v-lo.krakow.pl/~matmarex/testwiki-hu/index.php?title=Kateg%C3%B3ria:Test

It seems to work correctly at the first glance, but I don't speak Hungarian :)
Comment 5 Tisza Gergő 2013-03-02 14:41:10 UTC
Seems good to me, but I'll ask more knowledgeable people as well. Will there be a way to override the default placements? Foreign words should not be categorized under a digraph even if they are written the same way (e.g. the Tycho crater should be under T, not TY); there is of course no way to automatize that, bus something like {{#DEFAULTSORT:Tycho|T}} would be nice.
Comment 6 Bartosz Dziewoński 2013-03-02 15:30:48 UTC
(In reply to comment #5)
> Seems good to me, but I'll ask more knowledgeable people as well. Will there
> be
> a way to override the default placements? Foreign words should not be
> categorized under a digraph even if they are written the same way (e.g. the
> Tycho crater should be under T, not TY); there is of course no way to
> automatize that, bus something like {{#DEFAULTSORT:Tycho|T}} would be nice.

Good point, I didn't think of that. I tested and this seems possible by using a
[[zero-width non-joiner]] – you could place a {{DEFAULTSORT:T‌ycho}} on the page with such name to force it to behave correctly. This forces the "t" and "y" to be considered separately, and the non-joiner itself has no effect during sorting. (See that test category again.)

With Scribunto/Lua now being deployed, this could be easily be made into a template, looking and behaving somewhat like [[Template:lowercase title]], so that editors wouldn't have to worry about the strange syntax. 


----

Also, please hold a community discussion/voting on the Hungarian Wikipedia about this change, even if it's just a formality. I am not a WMF employee, but their policy is clear – a configuration change (especially one that is this disruptive) can only be made if there's obvious consensus. You can link the Hungarian testwiki I created there.

There's no hurry, especially since this change can only be made after MW 1.21wmf11 is deployed on March 13.

Here's a very similar voting/discussion I created on pl.wikipedia, regarding the same change, but for Polish: short explanation, voting and comments with yes/no icons.

https://pl.wikipedia.org/wiki/Wikipedia:PR#Zmiana_konfiguracji_.E2.80.93_w.C5.82.C4.85czenie_poprawnego_sortowania_artyku.C5.82.C3.B3w_na_stronach_kategorii
Comment 7 Samat 2013-03-02 16:54:36 UTC
Thank you for your effort! It will be a long-awaiting (~9 years) bug fix on the Hungarian Wikipedia.
Comment 8 Tisza Gergő 2013-03-03 17:20:58 UTC
Thanks! ‌ looks like a good solution. Would it be possible to make the digraphs title case (that is, "Cs" instead of "CS")?
Comment 9 Bartosz Dziewoński 2013-03-05 09:50:13 UTC
Sorry for late reply.

(In reply to comment #8)
> Would it be possible to make the
> digraphs title case (that is, "Cs" instead of "CS")?

Should be pretty easy to do. If that's how it's supposed to be done everywhere, I think we could titlecase the digraphs in IcuCollation::$tailoringFirstLetters['hu'] and it should "just work". I can do it if it's the proper solution.

And if that's only how hu.wiki wants this to look (and uppercased digraphs are correct in general), you could use a little CSS to uppercase the first letter and lowercase the rest:
    #mw-pages h3 { text-transform: lowercase; }
    #mw-pages h3::first-letter { text-transform: uppercase; }
Comment 10 Tisza Gergő 2013-03-09 16:30:52 UTC
Yes, as far as I am aware, it should always be done that way in Hungarian installations.
Comment 11 Bartosz Dziewoński 2013-03-09 18:11:26 UTC
I submitted Ie0ca297a to fix this (and deployed it on my testwiki).

Can you hold a little mini-voting (in the village pump, probably, see comment 6) to confirm you really do want this changes as the hu.wiki community? Just for the paper trail :)
Comment 12 Tisza Gergő 2013-03-09 18:28:59 UTC
I will start the on-wiki discussion shortly. A few more questions that came up:
- will it be harder the change the rules on the fly, if they turn out to be imperfect? I understand changing the collation is difficult because one has to reindex the whole table, but I suppose changing the first letters would be simpler.
- by the way, should we also check the collation itself? I have mostly collected input on the first letter grouping until now.
- will it be possible to create custom groups? (e.g. someone suggested using a "Numbers" group, having separate groups for all digits looks a bit silly)
- what is the logic for non-Hungarian characters? Accented latin characters seem to be ordered as if the accents were stripped, which is good, but it would be nice to see the rules spelled out somewhere.
Comment 13 Bartosz Dziewoński 2013-03-09 18:56:42 UTC
(In reply to comment #12)
> - will it be harder the change the rules on the fly, if they turn out to
> be imperfect? I understand changing the collation is difficult because
> one has to reindex the whole table, but I suppose changing the first
> letters would be simpler.

Real changes to the collation will require running the update script again,
which might take a couple of hours for hu.wiki (according to Reedy's
testing, it took about 20 hours for the 3.2 million pages on pl.wikipedia).
Category sorting might be slightly borked during this time, and all category
pages will have to be purged afterwards (action=purge or just wait till the
caches expire).

Changing the first letters later won't break the collation, since it's
entirely handled by an external library (ICU); it'll require a purge to
appear on-wiki, though.


> - by the way, should we also check the collation itself? I have mostly
> collected input on the first letter grouping until now.

Please do, but I'm pretty much certain it's correct; it's handled by the ICU
library, which is a battle-tested and mature piece of software.


> - will it be possible to create custom groups? (e.g. someone suggested
> using a "Numbers" group, having separate groups for all digits looks a
> bit silly)

This isn't supported right now, but at a first glance possible; it would
likely depend on whether creating the group would require different sorting
order. However, IMO this particular change should be done for all projects
at once, if desired, and should wait for the natural number sorting to be
implemented first (bug 6948) and for multiple collation support (bug 44667;
the chinese-collation branch includes this).


> - what is the logic for non-Hungarian characters? Accented latin
> characters seem to be ordered as if the accents were stripped, which is
> good, but it would be nice to see the rules spelled out somewhere.

Yes, that's exactly what happens, and similarly for accented variants of
letters in other alphabets; I though I mentioned that somewhere, apologies.
The default sorting rules are the ones [[Unicode Collation Algorithm]] uses;
they are appropriately tailored for each language-specific collation.

The default "first-letters" list includes full basic latin, greek and
cyrillic alphabets and I think all printable ASCII characters, as well as a
lot of letters from other alphabets and a whole lot of Unicode symbols. It
is generated by MediaWiki based on the data about which letters have
primary-level weight in UCA, but I'm not sure what is the exact behavior;
you can see the generation script at
/maintenance/language/generateCollationData.php in mediawiki/core
repository, and the pregenerated list at /serialized/first-letters-root.ser.
I doubt that's relevant, though. :)
Comment 14 Tim Starling 2013-03-11 05:17:31 UTC
The upgrade to ICU 4.8 should be done before any more wikis start using uca-* collations.
Comment 15 Bartosz Dziewoński 2013-03-18 12:19:34 UTC
The upgrade is done now. Submitted config change proposal as I0cfa3859.
Comment 16 Sam Reed (reedy) 2013-03-27 15:34:47 UTC
Done

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links