Last modified: 2013-10-29 05:09:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T31495, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 29495 - Numbering system grouping for Indian languages


Summary:	Numbering system grouping for Indian languages

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Internationalization (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement with 1 vote (vote)
Target Milestone:	---
Assigned To:	Santhosh Thottingal

URL:
Whiteboard:
Keywords:	i18n

Depends on:
Blocks:	40760 56295
	Show dependency tree / graph

Reported:	2011-06-20 05:17 UTC by praveenp
Modified:	2013-10-29 05:09 UTC (History)
CC List:	15 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description praveenp 2011-06-20 05:17:57 UTC

Mediawiki gives 3 digit grouping for numbers by default (1234567890 → 1,234,567,890). All Indian languages and many other Asian languages uses a different way of grouping (1234567890 → 1,23,45,67,890). Magicwords such as {{NUMBEROFARTICLES}} also give 3 digit group pre-formated counts. For languages like Malayalam, Hindi, Tamil, Kannada, Bengali, etc need easily readable and understandable formatting in traditional Indian style grouping style.

Pls see: http://en.wikipedia.org/wiki/Indian_numbering_system , http://ml.wikipedia.org/wiki/Special:Statistics (example for mediawiki's default grouping)

Comment 1 Brion Vibber 2011-06-20 17:46:59 UTC

I could have sworn this had already been done, but can't find it or a bug for it. :)

This'll require either a customized commafy() method on the Language subclass, or a way of triggering different behavior from a setting from the Message file (similar to the way the digit transform table can be specified there).

Questions:

* Should this grouping *always* be used, for all numbers? Are there exceptions for certain lengths or certain types of numbers? (Years in dates usually are not run through commafy() so won't have this applied.)

* Should this grouping *always* be used, regardless of whether using indic or western style digits? (see bug 29279)

* Would there be any controversy or conflict over making this change for any particular languages?

* Is there a complete list of which languages this should apply to?

Comment 2 Mark A. Hershberger 2011-06-20 22:40:55 UTC

From http://en.wikipedia.org/wiki/Decimal_mark#Countries_using_Arabic_numerals_with_decimal_comma

> In India, due to a numeral system using lakhs (lacs) (1,00,000 equal to
> 100 000) and crores (1,00,00,000 equal to 10 000 000), comma is used at
> levels of thousand, lakh and crore, for example, 10 million (1 crore)
> would be written as 1,00,00,000.

So it looks like it doesn't apply to devnagri digits.  But that is just a guess.

Comment 3 Shiju Alex 2011-06-21 10:25:43 UTC

//So it looks like it doesn't apply to Devanagari digits.  But that is just a
guess.//


It applies to all Indic language numerals (devanagri/kannada/Bengali/odia/....)Few languages like Malayalam, Tamil, Telugu use Indo-Arabic numerals. For them also the above enhancement is necessary.

Comment 4 praveenp 2011-06-21 13:16:17 UTC

(In reply to comment #1)

> * Should this grouping *always* be used, for all numbers? Are there exceptions
> for certain lengths or certain types of numbers? (Years in dates usually are
> not run through commafy() so won't have this applied.)

Yes, Years in dates should not be grouped. :)


> * Should this grouping *always* be used, regardless of whether using indic or
> western style digits? (see bug 29279)

This grouping should always be used, regardless style of digits. For example Malayalam numbers are archaic now, but using Indian style grouping with borrowed digits. As well as one can see the words crore and lakh are directly derived from Hindi words Karode and lakh. 

> * Would there be any controversy or conflict over making this change for any
> particular languages?

As far as now this kind of grouping is the only style popular in India. Even English news papers use the words crore, lakh and 3,2,2,.. grouping. This kind of grouping is probably easily readable and understandable. 

> * Is there a complete list of which languages this should apply to?

Currently Malayalam (ml), Hindi (hi), Sanskrit (sa), Tamil (ta), Kannada (kn), Telugu (te), Marathi (mr), Urdu (ur), Oriya (or), Bangali (bn), Panjabi (pa), Gujarati (gu), Bhojpuri (bho), Assamese (as), Kashmiri (ks) are okay with exception of date, with its own digits as well as with English (1,2,3,..,0) digits.

There may be more languages such as Sinhalese (si), Burmese (my), Farsi (fa), Dhivehi (dv) and many other South East Asian language, which are using same style numbering system.

Comment 5 Brion Vibber 2011-06-21 22:21:05 UTC

Ok since that's used for so many languages, probably easiest to do it in the base Language::commafy() triggered by a setting, that way we won't have to add extra classes just to duplicate the same alternate layout. :)

Maybe options like:

$digitGrouping = '1k';        // 1,000    10,000    1,000,000 (default)
$digitGrouping = '10k';       //  1000    10,000    1,000,000
$digitGrouping = 'indic';     // 1,000    10,000    10,00,000
$digitGrouping = 'none';      //  1000     10000      1000000

The default behavior would be covered by '1k' mode, adding the thousands separator at every 3 digits.

Current languages to switch from manual commafy() overrides to using '10k' mode, skipping the separator until reaching 10,000 (mostly Eastern European and some Central Asian languages):
* be_tarask
* bg
* et
* hy
* kaa
* kk_cyrl
* ksh
* ku_ku
* pl
* ru
* uk

Current languages to switch from manual commafy() to 'none' mode:
* km (see [[Khmer_numerals]])
* my (see [[Burmese_numerals]])

I wasn't sure if those non-conversions were right; Khmer and Burmese both use south-east asian indic script variants, but neither appears to use standard digit grouping per examples at above so I believe that is indeed correct, though Burmese is listed as sometimes using the crore/lakh grouping at [[Indian_numbering_system]].

Comment 6 Mayur 2011-06-22 16:54:07 UTC

I am agree with this Shiju and praveen with this Issue that All Indian languages and many other Asian languages uses a different way of grouping (1234567890 → 1,23,45,67,890).I think this system should be applied universally for all indic Wikis.Bcoz all indic wikis use the same format.

Regards
mayur

Comment 7 Santhosh Thottingal 2011-08-31 10:13:51 UTC

(In reply to comment #5)
> Ok since that's used for so many languages, probably easiest to do it in the
> base Language::commafy() triggered by a setting, that way we won't have to add
> extra classes just to duplicate the same alternate layout. :)
> 
> Maybe options like:
> 
> $digitGrouping = '1k';        // 1,000    10,000    1,000,000 (default)
> $digitGrouping = '10k';       //  1000    10,000    1,000,000
> $digitGrouping = 'indic';     // 1,000    10,000    10,00,000
> $digitGrouping = 'none';      //  1000     10000      1000000

A better way to specify these options in a more generic way is to follow the LC_NUMERIC grouping property format of Glibc locale definitions.

"grouping keyword consists of a sequence of semicolon-separated integers. Each integer specifies the number of digits in a group. The initial integer defines the size of the group immediately to the left of the decimal delimiter. The following integers define succeeding groups to the left of the previous group. If the last integer is not -1, the size of the previous group (if any) is used repeatedly for the remainder of the digits. If the last integer is -1, no further grouping is performed."

3;-1          123456,789
3       	123,456,789   (this is en_US default format)
3;2;-1       1234,56,789 
3;2           12,34,56,789 (this is Indic)
-1      	123456789 (equivalent to 'none')

This can cover any complex formatting requirements.

Comment 8 Siebrand Mazeland 2011-08-31 10:22:01 UTC

(In reply to comment #7)

> A better way to specify these options in a more generic way is to follow the
> LC_NUMERIC grouping property format of Glibc locale definitions.
>
> <snip>
>
> This can cover any complex formatting requirements.

Can you create a patch or implement this, Santhosh?

Comment 9 Akshay Agarwal 2011-09-01 17:59:53 UTC

Santhosh's solution can be implemented by modifying the specific language localization file in Glibc. For example, if we wanted to fix this issue for Hindi Wikipedia, then we would edit the hi_IN localization file & modify the LC_NUMERIC field value to 3;2 , recompile PHP & set the current locale of Hindi Wikipedia to hi_IN

An alternate solution is to use built in NumberFormatter class of PHP & specify the the language specific formatting as the pattern http://www.php.net/manual/en/numberformatter.setpattern.php
http://www.icu-project.org/apiref/icu4c/classDecimalFormat.html#_details
This class offers a wide range of options for formatting & can be implemented with just a few lines of code. 

Continuing with example of Hindi Wikipedia, this can be done as

Add to LocalSettings.php
$wgNumberPattern = "##,##,###";

Modify the commafy() in Language.php
function commafy( $_ ) {
                global $wgNumberPattern;
		$currentLocale = setlocale( LC_NUMERIC, "0" );
                $numberFormat = new NumberFormatter( $currentLocale, NumberFormatter::DEFAULT_STYLE );
                $numberFormat->setPattern( $wgNumberPattern );
                return $numberFormat->format( $_ );
	}

Comment 10 Niklas Laxström 2011-09-11 14:00:13 UTC

We can do this in MediaWiki itself. Yes it's duplication, but we can't yet require PHP 5.3 nor wait for PHP to be patched.

Comment 11 Santhosh Thottingal 2011-09-22 07:20:46 UTC

r97793 adds the required feature to support number grouping pattern. Will add the pattern in Message Classes soon.

Comment 12 Santhosh Thottingal 2011-09-22 09:24:31 UTC

##,##,### pattern added to ml, hi, pa, gu, or, bn, as, te, ta, kn, mr languages in r97804

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links