Last modified: 2013-07-25 17:14:22 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T32287, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 30287 - Implement uca-fa collation
Implement uca-fa collation
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 30673 50311
  Show dependency treegraph
 
Reported: 2011-08-09 12:35 UTC by reza1615
Modified: 2013-07-25 17:14 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description reza1615 2011-08-09 12:35:33 UTC
Hi,
Sorting characters in fa wiki and other projects are not correct it must be
like
آ-ا-ب-پ-ت-ث-ج-چ-ح-خ-د-ذ-ر-ز-ژ-س-ش-ص-ض-ط-ظ-ع-غ-ف-ق-ک-گ-ل-م-ن-ه-و-ی
would please correct it?
Comment 1 Roan Kattouw 2011-08-10 16:09:24 UTC
Do these sorting problems appear on category pages or somewhere else?
Comment 2 reza1615 2011-08-10 19:57:33 UTC
(In reply to comment #1)
> Do these sorting problems appear on category pages or somewhere else?
yes in
1-special: all of special's report
2- pagegenerator.py for bots
3-categories
4-all of pages that have wikimedia's list
Comment 7 Bawolff (Brian Wolff) 2011-08-17 18:42:43 UTC
re comment 6:
> http://ehsanakhgari.org/article/php/persian-sorting-mysql
sorting things on the php side like that article suggests also is probably not going to happen.


----

It looks like the code points for letters that appear in fa but not in ar have code points a bit higher then the letters that are in ar (thus binary sorting by code point gives bad results).

In my testing, using the uca-default collation instead of the standard uppercase collation should fix this. (I only tested that پ (U+67E) and ت (U+62A) are sorted correctly relative to each other, but since it fixes those two, I'm assuming the others are fixed too. If not we could probably write a custom collation fairly easily).

So basically, we need to enable uca-default on wikimedia to fix this, or at least fix on the categories.

Special page reports probably won't be fixed anytime soon unfortunately (Is there another bug for that?). pagegenerator.py probably is just using special:allpages, which probably won't be fixed in near future, but possibly the pywikipedia folks could sort the list on the client side.
Comment 8 reza1615 2011-08-30 14:01:50 UTC
when the uca-default collation instead of the standard uppercase collation will use?
Comment 9 Roan Kattouw 2011-08-31 13:20:29 UTC
(In reply to comment #8)
> when the uca-default collation instead of the standard uppercase collation will
> use?
All of our Apaches have to be upgraded to a newer version of Ubuntu first, so UCA is available. The operations team is still working on that.
Comment 10 Huji 2011-09-18 17:40:08 UTC
What this bug requests is implementing collations for Persian Wikipedia; therefore it is a duplicate for bug 164.

*** This bug has been marked as a duplicate of bug 164 ***
Comment 11 Bawolff (Brian Wolff) 2011-09-18 19:48:24 UTC
Weren't we trying to split bug 164 into multiple tracking bugs? At any rate more work needs to be done for this, so i think it makes sense to keep this open as a dependency of bug 30673.

Re-opening on that basis.
Comment 12 Philippe Verdy 2011-09-18 21:04:12 UTC
Collation table for Persian (a.k.a "Farsi") is documented by the MimerSQL documentation for developers on
http://developer.mimer.com/charts/persian.htm
which documents it with this rule:

CREATE COLLATION persian FROM eor USING
'[Arabic]'
'&#064E#<<#0650#<<#064F#<<#064B#<<#064D#<<#064C#'
'&#0621#<#0622#'
'&#0627#<<#0671#<#0621#<<#0623#<<#0672#<<#0625#'
'       <<#0673#<<#0624#<<#06CC##0654#<<<#0649##0654#<<<#0626#'
'&#06A9#<<#06AA#<<#06AB#<<#0643#<<#06AC#<<#06AD#<<#06AE#'
'&#06CF#<#0647#<<#06D5#<<#06C1#<<#0629#<<#06C3#<<#06C0#<<#06BE#'
'&#06CC#<<#0649#<<#06D2#<<#064A#<<#06D0#<<#06D1#<<#06CD#<<#06CE#'

where "eor" is the base collation used for the standard "European Ordering Rules" (defined as both an ISO standard and a CEN standard), from which most other collation orders are based, with very small tailorings. It has a few other settings that requires specific adjustments indicated by the "[Arabic]" tailoring attribute, which has the effect of reordering all Arabic blocks before all letters of other scripts (but still after the ignorables, whitespaces, variables, common length marks, common currency symbols, and common digits). The rule above adds specific reordering of a few other letters (look at the collation chart).

Yes, this is different from the standard collation for the Arabic language, which is a bit simpler (and only adjusts secondary differences):

CREATE COLLATION arabic FROM eor USING
'[Arabic]'
'&#0627#<<#0622#<<#0627#<<#0621#<<#0623#<<#0625#<<#0624#<<#0626#'
'&#064A#<<#0649#'

and it is also different from the Urdu collation which is a bit more complex:

CREATE COLLATION urdu FROM eor USING
'[Arabic]'
'&#064B#<<#0652#<<#064E#<<#0650#<<#064F#<<#0670#<<#0656#<<#0657#'
'       <<#064B#<<#064D#<<#064C#<<#0654#<<#0651#<<#0658#<<#0653#'
'&#0627#<<#0623#<#0622#'
'&#0648#<<#0624#'
'&#06CF#<#06C1#<<#0647#<#06BE#<#06C3#<<#0629#<#0621#'
'&#06CC#<<#0649#<<#064A#<<#0626#'
'&#0628#<#0628##06BE#'
'&#067E#<#067E##06BE#'
'&#062A#<#062A##06BE#'
'&#0679#<#0679##06BE#'
'&#062C#<#062C##06BE#'
'&#0686#<#0686##06BE#'
'&#062F#<#062F##06BE#'
'&#0688#<#0688##06BE#'
'&#0631#<#0631##06BE#'
'&#0691#<#0691##06BE#'
'&#06A9#<#06A9##06BE#'
'&#06AF#<#06AF##06BE#'
'&#0644#<#0644##06BE#'
'&#0645#<#0645##06BE#'
'&#0646#<#0646##06BE#'
'&#06BA#<#06BA##06BE#'
'&#0648#<#0648##06BE#'
'&#06CC#<#06CC##06BE#';

MimerSQL has defined these rules using EOR as the base collation; the CLDR project was initially based on the DUCET collation, but is now using a different base collation (a modified DUCET), which is nearer from the standard EOR (but still different).

Note that MimerSQL, just like also MySQL, the default Java runtime library,the .Net CLR library still does not support the newer syntax for contextual rules, and for reordering script blocks, which is only supported for now by the most recent version of ICU; it also lacks the support of newer attributes.

The DUCET will soon be changed to become nearer from the CLDR version made for ICU, but the modified DUCET in the CLDR also does not use any contextual rules (for compatibility with lots of other implementations of the UCA). For this reason, some scripts will still not sort as expected using only the CLDR rules, without using the extended syntax (for example with the Devanagari script, see the final vowelless consonnant clusters at end of syllables.

This is even more critical for Lao, which requires a very complex syllabification, that cannot be represented by a collation table, but only as a specific [Lao] attribute triggering its specific syllabification by code and sometimes dictionary lookups; the case also occurs with the collations for Thai and Khmer languages, but in less critical way).

So don't assume that any unique DUCET (or modified DUCET from CLDR, or even the EOR collation table) will make things correct for all languages. We still need tailorings on top of any base collation, for almost all languages in all scripts !
Comment 13 Huji 2011-09-18 22:03:23 UTC
Having the [Arabic] block listed before the Persian-specific letters is the reason letters will not be sorted correctly according to Persian alphabet.

For latin languages, this has been solved by introducing more than one collation in the latin1 family (i.e. latin1_german_ci, latin1_swedish_ci, ...). Using utf8_general_ci is also not an option: it works for Arabic, but not for Persian (for the above mentioned reason). The Persian community is also underrepresented in many of the online collaborations so I think it is very unlikely that MySQL or other responsibly authorities introduce a new collation (something along the lines of utf8_persian_ci) just for that purpose.

In the light of above explanation, what is a pragmatic solution to this problem?
Comment 14 Huji 2011-09-18 22:06:41 UTC
Of note: http://bugs.mysql.com/bug.php?id=29977

More than four years old.
Comment 15 Bawolff (Brian Wolff) 2011-09-18 22:21:29 UTC
Unless something has changed, we're not planning to use mysql's collation support, so this is irrelevant.
Comment 16 Philippe Verdy 2011-09-19 02:36:15 UTC
I've never said orsuggested that! This is perfectly relevant for the implementation of tailorings. This is also relevant because there's a documentation available for the collation needed for Persian, as well as because it is not the same as standard Arabic, or Urdu, as demonstrated...
Comment 17 Philippe Verdy 2011-09-19 02:43:41 UTC
Also, I did not used "MySQL" as the base documentation, but "MimerSQL", which does not have the bugs you have cited for MySQL (and the archived mails in those bugs are mostly about the primary level: what I cited was about the secondary level as well, forgotten in the discussions you cite, dating from 2007). MimerSQL apparently does not have these bugs, and that's why I cited it as a reference, but this does not mean that we need to use it for our code.
Comment 18 Bawolff (Brian Wolff) 2011-09-19 03:38:16 UTC
(In reply to comment #16)
> I've never said orsuggested that! This is perfectly relevant for the
> implementation of tailorings. This is also relevant because there's a
> documentation available for the collation needed for Persian, as well as
> because it is not the same as standard Arabic, or Urdu, as demonstrated...

I was more referring to Huji's comment about (what I took to be) mysql collations. To be honest at the time I made that comment, I had only briefly skimmed what you (Philippe) wrote.

However, with that said - since we plan to use php intl's extension, which is just a wrapper around the icu library - which from my understanding already implements persian tailorings (and from limited testing certainly seems to) a discussion about how to implement Persian tailorings isn't that relevant either.

All the hard stuff about this bug is essentially done (mostly by other libraries) Basically what's left is some loose ends related to being able to select which locale to use.
Comment 19 Philippe Verdy 2011-09-19 04:05:13 UTC
It's still interesting to know which version of ICU (and of its implemented CLDR data) is used in PHP's "intl" extension, or how it plans to support the expected change which will very likely occur soon.

(It is already being discussed in the internal Unicode mailing list, aka "unicore" for Unicode members, and on the CLDR mailing list, with ICU authors leading this CLDR discussion, but from which authors of PHP "intl" seem to be absent, following only the what is found in the CLDR releases ! It is also being discussed in the associated ISO working group maintaining the international collation standard, referenced by both the Unicode UCA technical standard and by the CLDR project in LDML specifications and in the design of tailoring rules).

More changes will appear soon in the next Unicode and CLDR versions (notably the DUCET will be significantly modified in the UTS, to become nearer from what is used in CLDR, and there should be changes to natively support the EOR collation supported by ISO and CEN standards).
Comment 20 Bawolff (Brian Wolff) 2011-09-19 04:54:45 UTC
(In reply to comment #19)
> It's still interesting to know which version of ICU (and of its implemented
> CLDR data) is used in PHP's "intl" extension...
> 

I believe that depend on what version of icu was available when intl was compiled. On my system its using 4.4.2 (according to phpinfo() ) which I believe corresponds to CLDR 1.8. I imagine other people would have it compiled with a different icu version.
Comment 21 Philippe Verdy 2011-09-19 05:30:37 UTC
So this does not match the current 2.0.1 (2011-07-18) update of CLDR, and also not the major 2.0 release (2011-05-25).

Version 1.8 is dated 2010-03-17, and still does not match Unicode 6, the current version of LDML, the modified version of the DUCET for the CLDR "root" locale, newer contextual collation tailoring rules, and the newer reordering of full scripts for specific languages that can be written in multiple scripts (e.g. Serbian, Japanese, Chinese and several of its dialects, many South or Central Asian languages). Version 1.8 also still does not work correctly for Khmer and Lao scripts, and even includes issues with Hangul (Korean).

Version tracking of PHP's "intl" extension and ICU is then needed (in addition to PHP version, if it creates a dependancy). You must be more specific than just speaking about "intl" being used in MediaWiki. On the opposite ICU remains in stricter sync with versions of Unicode UTS#10 (UCA), LDML, and CLDR data.

It's also important to track which part of the CLDR has been integrated when compiling ICU for the PHP "intl" extension, and which specific tailoring data have been built into that ICU module (or as external datafiles).

And as far as I know, ICU still does not natively implement the EOR collation (as defined equivalently in ISO and CEN standards); it also has some experimental code for future proposed or pending updates to these collation standards (including a refined, contextual, definition of static "collation levels", in order to later deprecate some of the too many existing "attributes" which often lack a stricter formal definition for interoperability).
Comment 22 Bawolff (Brian Wolff) 2011-09-19 05:36:01 UTC
>So this does not match the current 2.0.1 (2011-07-18) update of CLDR, and also
>not the major 2.0 release (2011-05-25).

Well I installed via apt-get, which I'm sure is a little dated. If you installed via some other means, it'd probably be more up to date.


> Version 1.8 is dated 2010-03-17, and still does not match Unicode 6, the
> current version of LDML, the modified version of the DUCET for the CLDR "root"
> locale, newer contextual collation tailoring rules, and the newer reordering of
[..]
> 
> Version tracking of PHP's "intl" extension and ICU is then needed (in addition
> to PHP version, if it creates a dependancy). You must be more specific than
> just speaking about "intl" being used in MediaWiki. On the opposite ICU remains
> in stricter sync with versions of Unicode UTS#10 (UCA), LDML, and CLDR data.
> 
> It's also important to track which part of the CLDR has been integrated when
> compiling ICU for the PHP "intl" extension, and which specific tailoring data
> have been built into that ICU module (or as external datafiles).
> 
> And as far as I know, ICU still does not natively implement the EOR collation
> (as defined equivalently in ISO and CEN standards); it also has some
> experimental code for future proposed or pending updates to these collation
> standards (including a refined, contextual, definition of static "collation
> levels", in order to later deprecate some of the too many existing "attributes"
> which often lack a stricter formal definition for interoperability).


Why? How does this affect us (beyond the obvious people using older version get crappier collation support).
Comment 23 Philippe Verdy 2011-09-19 23:30:57 UTC
(In reply to comment #22)
> Why? How does this affect us (beyond the obvious people using older version get
> crappier collation support).

Look at the many changes documented in the CLDR site, each version has a log listing these changes in the bug tracker, as well as a summary report for each version.

Yes since CLDR 1.8 (based on the Unicode 5.0 subset of the UCS, plus only 4 additional characters that were standardized soon with a minor updated of Unicode 5) and in sync with ISO 14651:2007), there has been significant changes that affects Persian sorting (as well as Urdu) for cases specific to languages other than Arabic, written with the Arabic scripts, as well as on the Bidi algorithm (before the Bidi classes were frozen).

Given that the major release 6.0 of Unicode is there now since months (as well as the 2011 release of ISO 10646 now in its second generation) and the Unicode DUCET has been released at the same time, and the CLDR project also integrated it, before proposing a new extension format for easier and stable tailorings, the ISO 14651 standard should be updated soon (there's still a few discussion about a few cases, notably for Lao, Hindi, and variable elements).

Then look at the ICU version log which also has its own buglist and tracker.
Comment 24 reza1615 2013-01-19 10:52:51 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > when the uca-default collation instead of the standard uppercase collation will
> > use?
> All of our Apaches have to be upgraded to a newer version of Ubuntu first, so
> UCA is available. The operations team is still working on that.

does uca-default collation is installed?
Comment 25 Bawolff (Brian Wolff) 2013-01-19 15:08:03 UTC
> does uca-default collation is installed?
Yes. It is currently enabled at pt.wikipedia.org

However it may sort some characters incorrectly until we support the tailored collations. Specificly everything that's coloured blue on http://collation-charts.org/icu442/icu442-fa.html will probably sort incorrectly
Comment 26 reza1615 2013-01-19 15:22:33 UTC
the bug are on
گ  DAAF
ک  DAA9
ژ  DA98
پ D9BE
ی DB8C
چ DA86
now it shows them at the end these Unicode glyphs are not in Arabic language and because of that they have problem in sorting
for example in 
http://fa.wikipedia.org/wiki/%D8%B1%D8%AF%D9%87:%DA%A9%D8%B4%D9%88%D8%B1%D9%87%D8%A7%DB%8C_%D8%A2%D8%B3%DB%8C%D8%A7%DB%8C%DB%8C


they should be like below which آ is the first

آ-ا-ب-پ-ت-ث-ج-چ-ح-خ-د-ذ-ر-ز-ژ-س-ش-ص-ض-ط-ظ-ع-غ-ف-ق-ک-گ-ل-م-ن-ه-و-ی
Comment 27 reza1615 2013-01-19 15:26:46 UTC
Also the blue rectangles are Urdu not Farsi except ه
Comment 28 Bartosz Dziewoński 2013-02-26 20:37:32 UTC
(This doesn't really depends on bug 30996, removing the dependency. If uca-default is suitable, it should block it. If it's not, support should be implemented based on I838484b9 and this short be marked as blocking bug 45443.)
Comment 29 Bawolff (Brian Wolff) 2013-04-16 17:14:54 UTC
This should be do-able now.
Comment 30 Sam Reed (reedy) 2013-05-16 20:14:20 UTC
(In reply to comment #29)
> This should be do-able now.

Do-able in what sense? Code-able or deploy-able?
Comment 31 Bawolff (Brian Wolff) 2013-05-17 01:10:56 UTC
Oh, I thought it was deployable, but looks like there's still some code to do. (There is definitely support in icu library. Its not in the array in Collation.php )
Comment 32 Gerrit Notification Bot 2013-05-17 02:03:59 UTC
Related URL: https://gerrit.wikimedia.org/r/64251 (Gerrit Change I3c30824f7d133cf615ec7c2c39d31f27c39f89fe)
Comment 33 Gerrit Notification Bot 2013-06-27 18:28:01 UTC
Change 64251 merged by jenkins-bot:
Add fa to collation list.

https://gerrit.wikimedia.org/r/64251
Comment 34 Bartosz Dziewoński 2013-06-27 18:31:27 UTC
Marking this as fixed, as the ability to do this has been implemented in MediaWiki proper.

Bug 50311 is now about deploying this in 'fa' wikis.
Comment 35 reza1615 2013-06-28 10:51:34 UTC
(In reply to comment #33)
> Change 64251 merged by jenkins-bot:
> Add fa to collation list.
> 
> https://gerrit.wikimedia.org/r/64251
>Based on http://collation-charts.org/icu442/icu442-fa.html
>Should be verified by a native speaker.
As a native speaker I confirm http://collation-charts.org/icu442/icu442-fa.html
Comment 36 Huji 2013-06-28 23:19:58 UTC
Second that.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links