Last modified: 2012-10-04 09:53:04 UTC
There are 4 Bangla (AKA Bengali) letters who comes with a NUKTA (U+09BC) and stands as a different meaning or pronunciation. They are Ra (U+09B0), Rra (U+09DC), Rha (U+09DD) and Yya (U+09DF). Ra 09B0 came from 09AC + 09BC Rra 09DC came from 09A1 + 09BC Rha 09DD came from 09A2 + 09BC Yya 09DF came from 09AF + 09BC This is what Unicode consortium says, because they didn't do any research on Bangla and followed ISCII. Any ways, let's come to the point. Wikipedia pages are behaving strangely after the correct input. If I write 09DC, it becomes 09A1 + 09BC automatically after saving the data. Like that if I write 09DD, it becomes 09A2 + 09BC and 09DF becomes 09AF + 09BC. Fortunately 09B0 is ok from this problem. 09B0 has no problem. Now you have to sort the issue by reversing the rendering. If I type 09B0, it should stay as it is and if I type 09AC + 09BC, it should become 09B0 automatically. As I told you that 09B0 has no problem, but you also need to define 09AC + 09BC = 09B0. Like that 09DC will stay as it is and if anyone puts 09A1 + 09BC, it will become 09DC after saving. 09DD will stay as it is and if anyone puts 09A2 + 09BC, it will become 09DD after saving and 09DF will stay as it is and if anyone puts 09AF + 09BC, it will become 09DF after saving. So please sort the issue ASAP. -- Omi Azad Contributor Bangla Computing and Localization Projects: Ankur: http://www.ankurbangla.org Ekushey: http://www.ekushey.org
This is a serious issue and would effect searches for articles, as the articles are automatically mistitled ... instead of one unicode character, the aforementioned characters are divided into two characters. So, anyone searching for an article title involving the above characters are not able to find them ... using either the bn-wiki's built-in search, or the google search. -- Ragib Administrator, bn-wiki
Unicode normalization is applied to all input, including both edits and search text, so this should work consistently in that respect. If there's a bug in the Unicode definitions, I'm afraid you'll need to take it up with Unicode to get it fixed consistently...
Well, I told you that UTC is full of Indic illiterate people and that is why they have too many problems. Still if I raise a legal issue to them, they don't understand what to do. :) Sir it's absolutely your problem. We are using thousands of software with UTF-8 encoding from both open and closed source field. None of them has this problem. If I write 09DD in Open Office, it never becomes 09A2 + 09BC, same for MS Office, even in Gedit or notepad. So what should I think? UTC made mistake by writing definition of these characters in http://www.unicode.org/charts/PDF/U0980.pdf and you followed that. Can you show me any reference from UTC site, which makes you think that the current rendering is Okey?
The rendering is a serious problem. Almost all other websites are rendering the above two unicode characters correctly. For example, please check the following page from BBC Bengali service's webpage (written using unicode Bangla). http://www.bbc.co.uk/bengali/news/story/2005/08/050831_mknizami.shtml Find the following word: রয়েছে Now, here is the same word when I write it in Wikipedia (English or Bangla) রয়েছে More specifically, look at the following character: য় : from BBC Bengali's site য় : from Wikipedia Now, you can check that the 2nd example is not the intendend letter yya, rather it is the juxtaposition of two letters, ja and nukta. (য + ় ) This normalization is totally incorrect, and is messing up with searches for the appropriate texts. Also, the correct unicode is used by almost all Bangla websites (I gave the example from BBC's Bengali service), so I don't see why wikipedia sould render it incorrectly, and thus make the articles unreachable from search engines. This bug is a serious one and needs to be fixed immediately. Thanks Ragib Admin, Bangla wikipedia
I'd also like to draw your attention to Google's Bangla language localized page at http://www.google.com.bd . Look at the text: ভাষা সম্পর্কিত হাতিয়ারসমূহ Specifically: in the word হাতিয়ারসমূহ you will find the character য় This is correctly rendered. Google is using the correct unicode symbol for yya, and NOT the incorrect juxtaposition of ja and nukta. I can give a lot of other examples, but I guess you'd understand the issue now. The Bangla typing systems, documents, everything else have already corrected this issue, and so has Firefox/mozilla in their localized version of Firefox/mozilla browsers. I don't see any reason to continue the incorrect rendering in Media wiki. This would hurt the Bangla wikipedia a lot as the articles will become unreachable from search engines ... because people looking for a page will not type the incorrect code, nor will google or anything else do the redundant mapping to the incorrect code pairs. Thanks Ragib
Since 2000 I'm working with Unicode, Microsoft and other orgs regarding Bangla issues. So I know what I'm saying. I asked Brion Vibber to show me any ref he has. I bet he cannot and this is a WiKi's problem indeed.
There are exactly two possibilities: 1) Our implementation of Unicode normalization is correct to specs. 2) Our implementation of Unicode normalization is incorrect and does not follow spec. If you can show that 2) is true, it's my problem and I'll be happy to fix it. However you indicate that 1) is the case. In this case you'll need to take it up with the Unicode Consortium to either get the UCD corrected or new characters added which have more appropriate normalization characteristics. Similar breakage will occur in all other applications that follow W3C recommendations to normalize input to form C, making it very much Unicode's problem if it's wrong.
Brion Brother, You didn't get me clearly. You said in #1 that "Our implementation of Unicode normalization is correct to specs" but I asked you to show me any document referring your correct specifications. If you cannot show that, then it automatically goes to #2 and you have to fix it. Bro, it's not a UTC problem. It's your problem. As Ragib provided some links of Bangla texts above, you can check them out. I can understand that you followed http://www.unicode.org/charts/PDF/U0980.pdf 's Additional Consonant section. They didn't tell you to make your normalization like the given reference. The reference is there to show you how the thing is. So please try to sort it asap.
http://www.unicode.org/reports/tr15/ http://www.unicode.org/ucd/
Bro, You didn't get me actually or may be I completely missed the track. You are doing as UTC said in http://www.unicode.org/reports/tr15/ Section: "Table 2: String Concatenation." That is only for the case if you type 09AC + 09BC or 09A1 + 09BC or 09A2 + 09BC or 09AF + 09BC They didn't tell you to follow the same rule if you directly type 09B0, 09DC, 09DD or 09DF, you don't need to re-encode them according to any rule. That is not a rule indeed. Let me try to tell you the whole thing once again. If I type 09B0, 09DC, 09DD or 09DF. You don't need to apply any rule to them. But if I type 09AC + 09BC or 09A1 + 09BC or 09A2 + 09BC or 09AF + 09BC, you can apply any normalization rule to them and that is what UTC is saying. But in your case, when I type 09DC, it becomes 09A1 + 09BC, which is very wrong. Please double check your reference documents, they didn't ask you to do anything like that. I hope you understand now...
Marking as: Bug 5948 blocks: Bug 3985: character conversion (tracking)
Let's see, the Unicode character database is: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt The entry for 09DC is: 09DC;BENGALI LETTER RRA;Lo;0;L;09A1 09BC;;;;N;;;;; That shows a canonical decomposition to 09A1 (BENGALI LETTER DDA) followed by 09BC (BENGALI SIGN NUKTA). We then check the composition exclusion table: http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt Here we find an entry excluding it from being produced by canonical composition: 09DC # BENGALI LETTER RRA Thus the normalized canonical composition (NFC) will remain decomposed, as 09A1 09BC. Further we can check the entry for this character in the normalization test suite: http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt Here we can see that 09DC normalizes the same way in all four forms: 09DC;09A1 09BC;09A1 09BC;09A1 09BC;09A1 09BC; # (ড়; ড◌়; ড◌়; ড◌়; ড◌়; ) BENGALI LETTER RRA I can confirm also that Python's Unicode normalization implementation produces the same output: >>> import unicodedata >>> unicodedata.normalize("NFC", u"\u09dc") u'\u09a1\u09bc' Case closed. If you don't like the normalization rules, talk to Unicode. If you find browsers with incorrect search systems, file a bug with them. If you find search engines with incorrect search systems, file a bug with them.
>Let's see, the Unicode character database is: >http://www.unicode.org/Public/UNIDATA/UnicodeData.txt > >The entry for 09DC is: >09DC;BENGALI LETTER RRA;Lo;0;L;09A1 09BC;;;;N;;;;; So leave 09DC as it is. Why you are moving to normalize it? > >That shows a canonical decomposition to 09A1 (BENGALI LETTER DDA) followed >by 09BC (BENGALI SIGN NUKTA). > >We then check the composition exclusion table: >http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt > >Here we find an entry excluding it from being produced by canonical >composition: >09DC # BENGALI LETTER RRA > >Thus the normalized canonical composition (NFC) will remain decomposed, as >09A1 09BC. Normalize is only require when I type ড followed by ় and if I type ড় it will remain same. > >Further we can check the entry for this character in the normalization >test suite: >http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt > >Here we can see that 09DC normalizes the same way in all four forms: >09DC;09A1 09BC;09A1 09BC;09A1 09BC;09A1 09BC; # (ড়; ড◌়; ড◌়; ড◌়; >ড◌়; ) BENGALI LETTER RRA Again I say the same thing. You need to use the normalization ting only for some sequences like ড ়, ঢ ়, য ় and ব ় (this one is not mentioned by them) but you applied the rule to all cases. > >I can confirm also that Python's Unicode normalization implementation >produces the same output: >>> import unicodedata >>> unicodedata.normalize("NFC", u"\u09dc") >u'\u09a1\u09bc' > >Case closed. UTC is not doing wrong by any mean. Why you are up to change a independent character to a character sequence? Check the documents carefully, UTC didn't tell you the *change it* anywhere and understand the issue. > >If you don't like the normalization rules, talk to Unicode. > >If you find browsers with incorrect search systems, file a bug with them. > >If you find search engines with incorrect search systems, file a bug with >them. Very silly answer. So you think that *only* you are moving with perfection and whole wold is wrong? Unicode is wrong, Browser is wrong, search engine is wrong, Microsoft is wrong, Sun is wrong, IBM is wrong, Mozilla is wrong? :) You are arguing by showing unnecessary points and trying not to understand the whole thing. Thousands of software are working fine except you. If you behave like this or don't try to understand the fact, WiKi will become Week-i to Bangla speaking community. If you are not satisfied with my points, try consulting with UTC. By this time I'll show this bug to my UTC contacts and I'll hope they'll give some light in this issue. Finally, you are mis-understanding the whole point alone with UTC's documentation.
When Brion's defence is based on "this is how it is done in Python", it is in Python where this bug needs fixing. It is similar to an issue in the Dutch language, there the ij is invariable written as an "i" and a "j". However the glyph kids learn in school is not this combination. I know that Me MediaWiki does not have this behaviour ij is not changed in it's two "parts"; it stays like it is. Thanks, GerardM
When Brion's defence is based on "this is how it is done in Python", it is in Python where this bug needs fixing. If this is so, this bug can be closed again It is similar to an issue in the Dutch language, there the ij is invariable written as an "i" and a "j". However the glyph kids learn in school is not this combination. I know that Me MediaWiki does not have this behaviour ij is not changed in it's two "parts"; it stays like it is. Thanks, GerardM
The python example is just showing that python correctly implement the unicode recommandation. Please reread comment 12 which explain why MediaWiki respect the normalization rule. Retagging as LATER. Fill a bug at unicode.org .
I've slapped up some notes at http://www.mediawiki.org/wiki/Unicode_normalization_considerations
[Qouting from http://www.mediawiki.org/wiki/Unicode_normalization_considerations ] * a surprising composition exclusion in Bangla o The result doesn't render right with some tools, probably again a platform-specific bug o Some third-party search tools apparently don't know how to normalize and fail to locate texts so normalized. The rendering and third-party search problems are annoying, though if we stay on our high horse we can try to ignore it and let the other parties fix their broken software over time. The canonical ordering problems are a harder issue; you simply can't get these right by following the current specs. Unicode won't change the ordering definitions because it would break their compatibility rules, so unless they introduce *new* characters with the correct values... Well, it's not clear this is going to happen. [/quote] I think I fail to make you understand about the problem. Also I'm not getting one thing, that why you are applying normalization rules in your software. There are thousands of web sites and millions of web pages currently in Bangla and the web page itself never apply any rule for rendering the character. The character always remain as it is. The wikimedia software is changing the character to a sequence, saying it's normalization. Let me give you a sort example, so that you can understand more clearly. If you type Â, it remains like that. It never becomes like A^, but about Bangla when I'm typing য়, it's becoming like য় As I said before, it's not our end's problem, it's your problem. Whenever I save my text, it should remain same as it is. If any rendering needed, the rendering engine should be responsible for this. Like Uniscribe Engine in Windows, Pango/QT on Linux etc. So it would be better if you remove all the normalization rules from your end and leave it on Application end.
Omi, do you have difficulty reading the things I've written? I ask this not to be rude, but because your responses don't appear to display any comprehension of any of the following: * The reasons given for why normalization is done * The reasons given for why the result is 100% correct implementation of specs (though the specs might not be to your liking) * The fact that I understand the problems with third party software that this causes * The fact that I am willing to accommodate the issue and made some recommendations on how to do this I'm not going to waste any more time discussing this issue with you if you're this incapable of following the discussion. If you still care about this issue, please ask someone who is able to follow an argument, read and understand documentation, and reason with others to continue instead of you.
After doing a huge R&D we found we just need to fix the fonts. Then everything will be sorted. Microsoft has came up with their solution and soon we'll fix the same on other fonts for Linux and OSX. The issue is sorted. Update your fonts and you'll find everything perfect.
Changing all WONTFIX high priority bugs to lowest priority (no mail should be generated since I turned it off for this.)
If I understand this report correctly, it turned out to be a font issue. So I am marking this as FIXED. If this is inaccurate then please REOPEN it.