Last modified: 2012-10-04 09:53:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T7948, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 5948 - Bangla letter Nukta Problem


Summary:	Bangla letter Nukta Problem

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Lowest critical with 2 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	http://bn.wikipedia.org
Whiteboard:
Keywords:

Depends on:
Blocks:	3985 40760
	Show dependency tree / graph

Reported:	2006-05-14 15:22 UTC by Omi Azad
Modified:	2012-10-04 09:53 UTC (History)
CC List:	4 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Omi Azad 2006-05-14 15:22:41 UTC

There are 4 Bangla (AKA Bengali) letters who comes with a NUKTA (U+09BC) and
stands as a different meaning or pronunciation. They are Ra (U+09B0), Rra
(U+09DC), Rha (U+09DD) and Yya (U+09DF).

Ra 09B0 came from 09AC + 09BC
Rra 09DC came from 09A1 + 09BC
Rha 09DD came from 09A2 + 09BC
Yya 09DF came from 09AF + 09BC

This is what Unicode consortium says, because they didn't do any research on
Bangla and followed ISCII. Any ways, let's come to the point.

Wikipedia pages are behaving strangely after the correct input. If I write 09DC,
it becomes 09A1 + 09BC automatically after saving the data. Like that if I write
09DD, it becomes 09A2 + 09BC and 09DF becomes 09AF + 09BC. Fortunately 09B0 is
ok from this problem. 09B0 has no problem.

Now you have to sort the issue by reversing the rendering. If I type 09B0, it
should stay as it is and if I type 09AC + 09BC, it should become 09B0
automatically. As I told you that 09B0 has no problem, but you also need to
define 09AC + 09BC = 09B0. Like that 09DC will stay as it is and if anyone puts
09A1 + 09BC, it will become 09DC after saving. 09DD will stay as it is and if
anyone puts 09A2 + 09BC, it will become 09DD after saving and 09DF will stay as
it is and if anyone puts 09AF + 09BC, it will become 09DF after saving.

So please sort the issue ASAP.

-- 
Omi Azad
Contributor
Bangla Computing and Localization Projects:
Ankur: http://www.ankurbangla.org
Ekushey: http://www.ekushey.org

Comment 1 Ragib Hasan 2006-05-14 15:34:55 UTC

This is a serious issue and would effect searches for articles, as the articles
are automatically mistitled ... instead of one unicode character, the
aforementioned characters are divided into two characters. So, anyone searching
for an article title involving the above characters are not able to find them
... using either the bn-wiki's built-in search, or the google search.


--

Ragib
Administrator, bn-wiki

Comment 2 Brion Vibber 2006-05-15 00:00:01 UTC

Unicode normalization is applied to all input, including both 
edits and search text, so this should work consistently in that 
respect.

If there's a bug in the Unicode definitions, I'm afraid you'll 
need to take it up with Unicode to get it fixed consistently...

Comment 3 Omi Azad 2006-05-15 04:54:12 UTC

Well, I told you that UTC is full of Indic illiterate people and 
that is why they have too many problems. Still if I raise a 
legal issue to them, they don't understand what to do. :)

Sir it's absolutely your problem. We are using thousands of 
software with UTF-8 encoding from both open and closed source 
field. None of them has this problem. If I write 09DD in Open 
Office, it never becomes 09A2 + 09BC, same for MS Office, even 
in Gedit or notepad. So what should I think?

UTC made mistake by writing definition of these characters in 
http://www.unicode.org/charts/PDF/U0980.pdf and you followed 
that. Can you show me any reference from UTC site, which makes 
you think that the current rendering is Okey?

Comment 4 Ragib Hasan 2006-05-15 05:36:33 UTC

The rendering is a serious problem.  Almost all other websites are rendering the
above two unicode characters correctly. For example, please check the following
page from BBC Bengali service's  webpage (written using unicode Bangla).


http://www.bbc.co.uk/bengali/news/story/2005/08/050831_mknizami.shtml


Find the following word:

রয়েছে 

Now, here is the same word when I write it in Wikipedia (English or Bangla)

রয়েছে

More specifically, look at the following character:

য় : from BBC Bengali's site
য় : from Wikipedia

Now, you can check that the 2nd example is not the intendend letter yya, rather
it is the juxtaposition of two letters, ja and nukta. (য + ় )

This normalization is totally incorrect, and is messing up with searches for the
appropriate texts. Also, the correct unicode is used by almost all Bangla
websites (I gave the example from BBC's Bengali service), so I don't see why
wikipedia sould render it incorrectly, and thus make the articles unreachable
from search engines. This bug is a serious one and needs to be fixed immediately.

Thanks

Ragib

Admin, Bangla wikipedia

Comment 5 Ragib Hasan 2006-05-15 06:03:17 UTC

I'd also like to draw your attention to Google's Bangla language localized page
at http://www.google.com.bd . Look at the text:

ভাষা সম্পর্কিত হাতিয়ারসমূহ

Specifically: in the word হাতিয়ারসমূহ

you will find the character য়

This is correctly rendered. Google is using the correct unicode symbol for yya,
and NOT the incorrect juxtaposition of ja and nukta.

I can give a lot of other examples, but I guess you'd understand the issue now.
The Bangla typing systems, documents, everything else have already corrected
this issue, and so has Firefox/mozilla in their localized version of
Firefox/mozilla browsers. I don't see any reason to continue the incorrect
rendering in Media wiki. This would hurt the Bangla wikipedia a lot as the
articles will become unreachable from search engines ... because people looking
for a page will not type the incorrect code, nor will google or anything else do
the redundant mapping to the incorrect code pairs.

Thanks

Ragib

Comment 6 Omi Azad 2006-05-15 06:29:39 UTC

Since 2000 I'm working with Unicode, Microsoft and other orgs regarding Bangla
issues. So I know what I'm saying. I asked Brion Vibber to show me any ref he
has. I bet he cannot and this is a WiKi's problem indeed.

Comment 7 Brion Vibber 2006-05-15 09:59:52 UTC

There are exactly two possibilities:
1) Our implementation of Unicode normalization is correct to specs.
2) Our implementation of Unicode normalization is incorrect and does 
not follow spec.

If you can show that 2) is true, it's my problem and I'll be happy to 
fix it.

However you indicate that 1) is the case. In this case you'll need to 
take it up with the Unicode Consortium to either get the UCD 
corrected or new characters added which have more appropriate 
normalization characteristics. Similar breakage will occur in all 
other applications that follow W3C recommendations to normalize input 
to form C, making it very much Unicode's problem if it's wrong.

Comment 8 Omi Azad 2006-05-15 15:47:46 UTC

Brion Brother,
You didn't get me clearly. You said in #1 that "Our implementation of Unicode
normalization is correct to specs" but I asked you to show me any document
referring your correct specifications. If you cannot show that, then it
automatically goes to #2 and you have to fix it.

Bro, it's not a UTC problem. It's your problem. As Ragib provided some links of
Bangla texts above, you can check them out.

I can understand that you followed http://www.unicode.org/charts/PDF/U0980.pdf
's Additional Consonant section. They didn't tell you to make your normalization
like the given reference. The reference is there to show you how the thing is.
So please try to sort it asap.

Comment 9 Brion Vibber 2006-05-15 23:53:21 UTC

http://www.unicode.org/reports/tr15/
http://www.unicode.org/ucd/

Comment 10 Omi Azad 2006-05-16 11:03:57 UTC

Bro,
You didn't get me actually or may be I completely missed the track. You are
doing as UTC said in http://www.unicode.org/reports/tr15/ Section: "Table 2:
String Concatenation." That is only for the case if you type 09AC + 09BC or 09A1
+ 09BC or 09A2 + 09BC or 09AF + 09BC They didn't tell you to follow the same
rule if you directly type 09B0, 09DC, 09DD or 09DF, you don't need to re-encode
them according to any rule. That is not a rule indeed.

Let me try to tell you the whole thing once again. If I type 09B0, 09DC, 09DD or
09DF. You don't need to apply any rule to them. But if I type 09AC + 09BC or
09A1 + 09BC or 09A2 + 09BC or 09AF + 09BC, you can apply any normalization rule
to them and that is what UTC is saying. But in your case, when I type 09DC, it
becomes 09A1 + 09BC, which is very wrong. Please double check your reference
documents, they didn't ask you to do anything like that. I hope you understand
now...

Comment 11 lɛʁi לערי ריינהארט 2006-05-16 19:25:15 UTC

Marking as:
Bug 5948 blocks: Bug 3985: character conversion (tracking)

Comment 12 Brion Vibber 2006-05-16 23:41:37 UTC

Let's see, the Unicode character database is:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

The entry for 09DC is:
09DC;BENGALI LETTER RRA;Lo;0;L;09A1 09BC;;;;N;;;;;

That shows a canonical decomposition to 09A1 (BENGALI LETTER DDA) followed 
by 09BC (BENGALI SIGN NUKTA).

We then check the composition exclusion table:
http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt

Here we find an entry excluding it from being produced by canonical 
composition:
09DC    #  BENGALI LETTER RRA

Thus the normalized canonical composition (NFC) will remain decomposed, as 
09A1 09BC.

Further we can check the entry for this character in the normalization 
test suite:
http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt

Here we can see that 09DC normalizes the same way in all four forms:
09DC;09A1 09BC;09A1 09BC;09A1 09BC;09A1 09BC; # (ড়; ড◌়; ড◌়; ড◌়; 
ড◌়; ) BENGALI LETTER RRA

I can confirm also that Python's Unicode normalization implementation 
produces the same output:
>>> import unicodedata
>>> unicodedata.normalize("NFC", u"\u09dc")
u'\u09a1\u09bc'

Case closed.

If you don't like the normalization rules, talk to Unicode.

If you find browsers with incorrect search systems, file a bug with them.

If you find search engines with incorrect search systems, file a bug with 
them.

Comment 13 Omi Azad 2006-05-17 06:03:00 UTC

>Let's see, the Unicode character database is:
>http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>
>The entry for 09DC is:
>09DC;BENGALI LETTER RRA;Lo;0;L;09A1 09BC;;;;N;;;;;


So leave 09DC as it is. Why you are moving to normalize it?

>
>That shows a canonical decomposition to 09A1 (BENGALI LETTER DDA) followed
>by 09BC (BENGALI SIGN NUKTA).
>
>We then check the composition exclusion table:
>http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt
>
>Here we find an entry excluding it from being produced by canonical
>composition:
>09DC # BENGALI LETTER RRA
>
>Thus the normalized canonical composition (NFC) will remain decomposed, as
>09A1 09BC.

Normalize is only require when I type ড followed by ় and if I type ড় it will
remain same.

>
>Further we can check the entry for this character in the normalization
>test suite:
>http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
>
>Here we can see that 09DC normalizes the same way in all four forms:
>09DC;09A1 09BC;09A1 09BC;09A1 09BC;09A1 09BC; # (ড়; ড◌়; ড◌়; ড◌়;
>ড◌়; ) BENGALI LETTER RRA

Again I say the same thing. You need to use the normalization ting only for some
sequences like ড ়, ঢ ়, য ় and ব ় (this one is not mentioned by them) but you
applied the rule to all cases.

>
>I can confirm also that Python's Unicode normalization implementation
>produces the same output:
>>> import unicodedata
>>> unicodedata.normalize("NFC", u"\u09dc")
>u'\u09a1\u09bc'
>
>Case closed.

UTC is not doing wrong by any mean. Why you are up to change a independent
character to a character sequence? Check the documents carefully, UTC didn't
tell you the *change it* anywhere and understand the issue.

>
>If you don't like the normalization rules, talk to Unicode.
>
>If you find browsers with incorrect search systems, file a bug with them.
>
>If you find search engines with incorrect search systems, file a bug with
>them.

Very silly answer.
So you think that *only* you are moving with perfection and whole wold is wrong?
Unicode is wrong, Browser is wrong, search engine is wrong, Microsoft is wrong,
Sun is wrong, IBM is wrong, Mozilla is wrong? :)

You are arguing by showing unnecessary points and trying not to understand the
whole thing. Thousands of software are working fine except you. If you behave
like this or don't try to understand the fact, WiKi will become Week-i to Bangla
speaking community. If you are not satisfied with my points, try consulting with
UTC. By this time I'll show this bug to my UTC contacts and I'll hope they'll
give some light in this issue.

Finally, you are mis-understanding the whole point alone with UTC's documentation.

Comment 14 Gerard Meijsssen 2006-05-20 08:51:54 UTC

When Brion's defence is based on "this is how it is done in Python", it is in
Python where this bug needs fixing.

It is similar to an issue in the Dutch language, there the ij is invariable
written as an "i" and a "j". However the glyph kids learn in school is not this
combination. I know that Me
MediaWiki does not have this behaviour ĳ is not changed in it's two "parts"; it
stays like it is.

Thanks,
   GerardM

Comment 15 Gerard Meijsssen 2006-05-20 08:53:18 UTC

When Brion's defence is based on "this is how it is done in Python", it is in
Python where this bug needs fixing. If this is so, this bug can be closed again

It is similar to an issue in the Dutch language, there the ij is invariable
written as an "i" and a "j". However the glyph kids learn in school is not this
combination. I know that Me
MediaWiki does not have this behaviour ĳ is not changed in it's two "parts"; it
stays like it is.

Thanks,
   GerardM

Comment 16 Antoine "hashar" Musso (WMF) 2006-05-20 09:02:13 UTC

The python example is just showing that python correctly
implement the unicode recommandation.

Please reread comment 12 which explain why MediaWiki
respect the normalization rule.

Retagging as LATER. Fill a bug at unicode.org .

Comment 17 Brion Vibber 2006-05-22 19:09:07 UTC

I've slapped up some notes at
http://www.mediawiki.org/wiki/Unicode_normalization_considerations

Comment 18 Omi Azad 2006-05-23 06:07:11 UTC

[Qouting from http://www.mediawiki.org/wiki/Unicode_normalization_considerations ]
    * a surprising composition exclusion in Bangla
          o The result doesn't render right with some tools, probably again a
platform-specific bug
          o Some third-party search tools apparently don't know how to normalize
and fail to locate texts so normalized.

The rendering and third-party search problems are annoying, though if we stay on
our high horse we can try to ignore it and let the other parties fix their
broken software over time.

The canonical ordering problems are a harder issue; you simply can't get these
right by following the current specs. Unicode won't change the ordering
definitions because it would break their compatibility rules, so unless they
introduce *new* characters with the correct values... Well, it's not clear this
is going to happen. [/quote]

I think I fail to make you understand about the problem. Also I'm not getting
one thing, that why you are applying normalization rules in your software. There
are thousands of web sites and millions of web pages currently in Bangla and the
web page itself never apply any rule for rendering the character. The character
always remain as it is. The wikimedia software is changing the character to a
sequence, saying it's normalization.

Let me give you a sort example, so that you can understand more clearly. If you
type Â, it remains like that. It never becomes like A^, but about Bangla when
I'm typing য়, it's becoming like য়

As I said before, it's not our end's problem, it's your problem. Whenever I save
my text, it should remain same as it is. If any rendering needed, the rendering
engine should be responsible for this. Like Uniscribe Engine in Windows,
Pango/QT on Linux etc. So it would be better if you remove all the normalization
rules from your end and leave it on Application end.

Comment 19 Brion Vibber 2006-05-23 06:34:06 UTC

Omi, do you have difficulty reading the things I've written?

I ask this not to be rude, but because your responses don't appear to 
display any comprehension of any of the following:
* The reasons given for why normalization is done
* The reasons given for why the result is 100% correct implementation 
of specs (though the specs might not be to your liking)
* The fact that I understand the problems with third party software 
that this causes
* The fact that I am willing to accommodate the issue and made some 
recommendations on how to do this

I'm not going to waste any more time discussing this issue with you 
if you're this incapable of following the discussion. If you still 
care about this issue, please ask someone who is able to follow an 
argument, read and understand documentation, and reason with others 
to continue instead of you.

Comment 20 Omi Azad 2007-04-15 08:15:43 UTC

After doing a huge R&D we found we just need to fix the fonts. Then everything
will be sorted. Microsoft has came up with their solution and soon we'll fix the
same on other fonts for Linux and OSX. The issue is sorted. Update your fonts
and you'll find everything perfect.

Comment 21 Mark A. Hershberger 2011-03-13 17:46:01 UTC

Changing all WONTFIX high priority bugs to lowest priority (no mail should be generated since I turned it off for this.)

Comment 22 Diederik van Liere 2011-11-29 21:52:13 UTC

If I understand this report correctly, it turned out to be a font issue. So I am marking this as FIXED. If this is inaccurate then please REOPEN it.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links