Last modified: 2013-12-19 20:55:13 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T57630, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 55630 - When using UCA collations, Persian digits (۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹) sorted under Western Arabic digits' (0 1 2 3 4 5 6 7 8 9) headings
When using UCA collations, Persian digits (۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹) sorted under ...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Categories (Other open bugs)
1.22.0
All All
: Normal normal with 3 votes (vote)
: 1.23.0 release
Assigned To: Bartosz Dziewoński
: code-update-regression
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-11 16:47 UTC by Bartosz Dziewoński
Modified: 2013-12-19 20:55 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Bartosz Dziewoński 2013-10-11 16:47:28 UTC
Quoting bug 55565 comment #11:
https://fa.wikipedia.org/wiki/%D8%B1%D8%AF%D9%87:%D8%B5%D9%88%D8%B1%D8%AA_%D9%81%D9%84%DA%A9%DB%8C_%D8%A8%D8%B1%D9%87
seems all digits type on first of page title is being converted to Arabic
digits. We shouldn't see '1' '2' '3' (Arabic Digits) and we should see '۱' '۲'
'۳' (Persian Digits) instead. Reproducible on all categories and also on
ckbwiki
https://ckb.wikipedia.org/w/index.php?title=%D9%BE%DB%86%D9%84:%DA%95%DB%86%DA%98%DB%95%DA%A9%D8%A7%D9%86%DB%8C_%D8%B3%D8%A7%DA%B5&action=edit&redlink=1
that are using Arabic-Indic digits.

----

Might also affect other numeral systems, I didn't test.
Comment 1 [no longer active user] 2013-10-11 16:50:06 UTC
Okay on bengali https://bn.wikipedia.org/wiki/%E0%A6%AC%E0%A6%BF%E0%A6%B7%E0%A6%AF%E0%A6%BC%E0%A6%B6%E0%A7%8D%E0%A6%B0%E0%A7%87%E0%A6%A3%E0%A7%80:%E0%A6%AC%E0%A6%9B%E0%A6%B0 but fails on Arabic-Indic (Eastern Arabic) and Persian
Comment 2 Bawolff (Brian Wolff) 2013-10-11 16:51:48 UTC
If they have same primary weight, we could just remove latin digits from first letters and add the farsi ones on a per language basis.
Comment 3 Bartosz Dziewoński 2013-10-11 16:58:26 UTC
Yeah, that'll probably work, but I'm wondering why did it start happening now after a supposedly minor package upgrade.
Comment 4 Bartosz Dziewoński 2013-10-12 22:01:33 UTC
allkeys.txt entries for '1' and '۱':

0031  ; [.159A.0020.0002.0031] # DIGIT ONE
06F1  ; [.159A.0020.0002.06F1][.0000.0166.0002.06F1] # EXTENDED ARABIC-INDIC DIGIT ONE

Same primary weight.


Trying to list each digit for each language IMO makes little sense (grepping the allkeys.txt file for "DIGIT ONE" yields 60 results).

I think we could use Language#formatNum() for each digit instead and replace Latin ones with localized ones in IcuCollation#getFirstLetterData (per Brian's suggestion), after applying $tailoringFirstLetters.
Comment 5 Calak 2013-10-12 22:16:31 UTC
Be aware, we use different unicode for digits on ckb.wiki:
٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
Comment 6 Gerrit Notification Bot 2013-10-12 22:29:26 UTC
Change 89488 had a related patch set uploaded by Bartosz Dziewoński:
IcuCollation: Sort digits under localised digits' headings

https://gerrit.wikimedia.org/r/89488
Comment 7 Bartosz Dziewoński 2013-10-12 22:31:45 UTC
Bah, and of course ckb.wp has to use the 'uca-fa' collation, because otherwise it would be too easy to fix.

My patch above doesn't handle this case, because I don't see how we could do it without creating a faux collation for ckb and if()-ing it (which would be ugly), or using wiki language instead of collation language (which would be unexpected). Input welcome.
Comment 8 Bawolff (Brian Wolff) 2013-10-13 00:06:55 UTC
I guess we could apply the digit transformation on rendering a numeric section header in the category page, instead of in the collation. Not sure if that's really a good idea though.
Comment 9 Tim Starling 2013-10-28 05:50:24 UTC
There could be a "ckb" collation (i.e. not uca-ckb), class name CollationCkb which is a subclass of IcuCollation. You could have IcuCollation::getDigitTransformTable() which is overridden by the subclass. CollationCkb::__construct() would call parent::__construct('fa').

Doing it that way means that when ICU adds support for ckb, migration from ckb to uca-ckb can be done without breaking the wiki.

Or if the problem is likely to be repeated with other languages, there could be some regex-based alias feature in Collation::factory(), e.g. "alias-ckb/fa", where the collation name would specify both the ICU locale and the MW locale.
Comment 10 Gerrit Notification Bot 2013-11-17 15:06:10 UTC
Change 95867 had a related patch set uploaded by Bartosz Dziewoński:
IcuCollation: Add CollationCkb subclass for Sorani Kurdish

https://gerrit.wikimedia.org/r/95867
Comment 11 Bartosz Dziewoński 2013-11-17 15:07:54 UTC
(In reply to comment #9)
> There could be a "ckb" collation (i.e. not uca-ckb), class name CollationCkb
> which is a subclass of IcuCollation. You could have
> IcuCollation::getDigitTransformTable() which is overridden by the subclass.
> CollationCkb::__construct() would call parent::__construct('fa').

I implemented this in the patch above (which depends on the previous patch, https://gerrit.wikimedia.org/r/89488).


> Or if the problem is likely to be repeated with other languages, there could
> be
> some regex-based alias feature in Collation::factory(), e.g. "alias-ckb/fa",
> where the collation name would specify both the ICU locale and the MW locale.

I did not implement this, hopefully it will never be needed, because it sounds bad. :) But if we ever need it, it won't be hard to migrate.
Comment 13 Bartosz Dziewoński 2013-11-22 16:22:50 UTC
We're still working on it :) Both of my patches are waiting to be re-reviewed.
Comment 14 Gerrit Notification Bot 2013-12-12 04:45:15 UTC
Change 89488 merged by jenkins-bot:
IcuCollation: Sort digits under localised digits' headings

https://gerrit.wikimedia.org/r/89488
Comment 15 Gerrit Notification Bot 2013-12-12 04:49:49 UTC
Change 95867 merged by jenkins-bot:
IcuCollation: Add CollationCkb subclass for Sorani Kurdish

https://gerrit.wikimedia.org/r/95867
Comment 16 Gerrit Notification Bot 2013-12-12 15:55:21 UTC
Change 101005 had a related patch set uploaded by Bartosz Dziewoński:
(bug 55630) $wgCategoryCollation = 'xx-uca-ckb' for ckbwiki

https://gerrit.wikimedia.org/r/101005
Comment 17 Bartosz Dziewoński 2013-12-12 16:01:45 UTC
Status update: Tim merged the two patches. Thanks!

* This means that category headings on fa.wikipedia and other wikis
  using languages with localised digits will start behaving correctly
  as soon as they are deployed, which will happen on 19 December
  (according to [[mw:MediaWiki_1.23/Roadmap]]).
* ckb.wikipedia is troublesome because it's currently using a
  collation meant for 'fa'; my configuration patch above fixes that as
  well. (If it were not deployed, 'fa' digits would be used instead of
  'ckb' digits.)

I'll leave this open for a while longer until everything is sorted out.
Comment 18 Gerrit Notification Bot 2013-12-19 19:03:41 UTC
Change 101005 merged by jenkins-bot:
(bug 55630) $wgCategoryCollation = 'xx-uca-ckb' for ckbwiki

https://gerrit.wikimedia.org/r/101005
Comment 19 Bartosz Dziewoński 2013-12-19 20:46:39 UTC
Looking at links from comment 0, everything seems to be in order now. Thanks for the help and reports, everyone!
Comment 20 Calak 2013-12-19 20:55:13 UTC
Thank you very much Bartosz Dziewoński.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links