Last modified: 2014-11-16 03:06:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T75453, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 73453 - Tamil sort order
Tamil sort order
Status: NEW
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
unspecified
All All
: Normal enhancement with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 32578 collations
  Show dependency treegraph
 
Reported: 2014-11-15 08:56 UTC by Sundar
Modified: 2014-11-16 03:06 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Showing Tamil consonant sequences (29.06 KB, image/jpeg)
2014-11-15 08:56 UTC, Sundar
Details

Description Sundar 2014-11-15 08:56:33 UTC
Created attachment 17139 [details]
Showing Tamil consonant sequences

Reference page: https://ta.wikipedia.org/s/4om
If you see in the above page, 'ஜ' follows 'ச'. Characters like 'ஸ', 'ஷ', 'ஜ', 'ஹ' etc. are called grantha characters which are not part of the basic alphabets of Tamil. See https://en.wikipedia.org/wiki/Tamil_script#Basic_consonants They are added towards the end (i.e. after 'ன') by convention. The first column in the attached image shows the correct sequence. (Image source: Naga. Ilangovan)
Comment 1 Sam Reed (reedy) 2014-11-15 10:28:27 UTC
I think we need to add Collation support for Tamil (not sure if we need upstream libicu stuff), and look at getting the category collation updated on tawiki
Comment 2 Bartosz Dziewoński 2014-11-15 16:04:21 UTC
ICU appears to support Tamil (http://bugs.icu-project.org/trac/browser/icu/trunk/source/data/coll/ta.txt), so we only need to add it to the list of supported collations and perhaps adjust first-letter generation. (And then confirm that it actually sorts the words correctly.)
Comment 3 Sundar 2014-11-15 16:40:22 UTC
Thanks Same Reed and Bartosz Dziewoński. Yes, http://bugs.icu-project.org/trac/browser/icu/trunk/source/data/coll/ta.txt is correct for the consonant sequence. We just need to validate the overall sequence of vowels, consonants, compounds.
Comment 4 elan 2014-11-16 02:35:31 UTC
Following are the two other related issues that I would like to be added to this bug report.

1) The sort position of letter ஃ should be after all the vowels. Currently, it is positioned like  ஃ, அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ. The right order is அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ, ஃ. It should be noted that, upon sorting all the Tamil letters, ஃ should appear after ஔ and before க்.

2) The Consonant letters should appear on top of their compounding forms. If we sort the letters (ம, ம், மா ), the right result is ( ம், ம, மா ). Currently the order is (ம, மா, ம் ) which is wrong. The impact of this can be understood by sorting a few strings. Given the set of 4 strings as (கணமொழி, கணமூலி, கணம்புல், கணம்), current sort order results into (கணமூலி, கணமொழி, கணம், கணம்புல்). This is wrong and the right order is (கணம், கணம்புல், கணமூலி, கணமொழி). (These 4 strings are proper Tamil words according to Tamil lexicon @ http://dsalsrv02.uchicago.edu/dictionaries/tamil-lex/ .
Comment 5 Sundar 2014-11-16 03:06:37 UTC
(In reply to elan from comment #4)
> Following are the two other related issues that I would like to be added to
> this bug report.
> 
> 1) The sort position of letter ஃ should be after all the vowels. Currently,
> it is positioned like  ஃ, அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ. The right
> order is அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ, ஃ. It should be noted that,
> upon sorting all the Tamil letters, ஃ should appear after ஔ and before க்.
> 
> 2) The Consonant letters should appear on top of their compounding forms. If
> we sort the letters (ம, ம், மா ), the right result is ( ம், ம, மா ).
> Currently the order is (ம, மா, ம் ) which is wrong. The impact of this can
> be understood by sorting a few strings. Given the set of 4 strings as
> (கணமொழி, கணமூலி, கணம்புல், கணம்), current sort order results into (கணமூலி,
> கணமொழி, கணம், கணம்புல்). This is wrong and the right order is (கணம்,
> கணம்புல், கணமூலி, கணமொழி). (These 4 strings are proper Tamil words according
> to Tamil lexicon @ http://dsalsrv02.uchicago.edu/dictionaries/tamil-lex/ .

Yes, the above sequence is the correct order.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links