Last modified: 2014-11-17 10:16:58 UTC
I can't find a bug report but there is discussion of the Hebrew case here: http://en.wikipedia.org/wiki/Wikipedia:Niqqud The Hebrew case seems to have been known for some time. Now we are noticing a similar problem with Arabic on Wiktionary. There is some discussion here: http://en.wiktionary.org/wiki/Talk:%D8%AC%D8%AF%D8%A7
The bug as I noticed it, is caused by the special characters used for vowels, dagesh, right & left shin etc. not being sorted properly by wiki / probably not being recognized as RTL. Lots of free texts in Hebrew are quite ancient and depend on Niqqud to be read properly, so fixing this bug should take a high priority IMHO.
Input text is checked for valid UTF-8 and normalized to Unicode Normalization Form C (canonical composed form). Someone needs to provide: * Short and exact before and after examples * If possible, a comparison against other Unicode normalization implementations to show if we're performing normalization incorrectly If there is an error in my normalization implementation, and it can be narrowed down, I'd be happy to fix it. If this is the result of the correct normalization algorithm, I'm not really sure what to do.
For a typical before and after example, see the following comaprison of versions: http://he.wikisource.org/w/index.php?title=%D7%90%D7%92%D7%A8%D7%AA_%D7%94%D7%A8%D7%9E%D7%91%22%D7%9F&diff=2794&oldid=1503 In that example, the only change actually made by the user was adding a category at the end, but when the text was saved, the order of vowels was altered in most of the words in the text. If what Brion means is an example of a single word or something like that, it will be hard to provide examples because only texts contributed until December show "before" examples. However, maybe this will help: When vowelized texts from word processors like Word and Open Office are pasted into Wiki edit boxes, the vowels are automatically changed to the wrong positions in the wiki coding.
Dovi, what browser are you using, and which version of it? Which operating system? Looking at the diff that you provided, checking the first few lines, those look OK to me. All the letters are identical on the right and on the left.
Comparing with Brion's laptop (he uses MacOS 10.4, I use 10.3.9) the letters differ between mine and his. There are dots in some of Brion's letters where I don't see any.
(I was testing in Safari and JeLuF in Firefox. They may render differently, or have been using different fonts...) Yes, I would very much like to get individual words. You can copy them out of the Wikipedia pages if you like. Very very helpful for each of these would be: * The 'before' formatting, saved in a UTF-8 text file (notepad on Windows XP is ok for this) * The 'after' formatting, saved in a UTF-8 text file * A detailed, close-up rendering of what it's supposed to look like (screen shot of 'before' correctly rendered, using a large enough font size I can tell the difference) * A detailed, close-up rendering of what it ends up looking like If possible, a description of which bits have moved or changed and how this affects the reading of the text.
Created attachment 751 [details] a txt file in utf-8
I’m using IE 6 in Win2K Professional, and I’ve been seeing this problem as well. Text that I created a year or so ago in Arabic are fine, but if I now open and re-save them (using all of the same software as before), Arabic vowel pairs become reversed. I can provide you here with some examples, one with the vowels together, and another separating the vowels with a tashdid (baseline) ... then you can remove the tashdid and bring the vowels together to see what happens. (Tahoma would be a good font to see this.) 1. This pair is supposed to look like a little superscript w with an '''over'''line: سسّـَس سسَّس (if you get an '''under'''lined w, it’s reversed). 2. This pair is supposed to look like a little superscript w with an '''under'''line: سسّـِس سسِّس (if the underline is below the entire '''word''' rather than below the little '''w''', it’s reversed). 3. This pair is supposed to look like a little superscript w with a '''double over'''line: سسّـًا سسًّا (if you get a w with double '''under'''line, it’s reversed). 4. This pair is supposed to look like a little superscript w with a '''double under'''line: سسّـٍا سسٍّا (if the double underline is below the entire word rather than below the little w, it’s reversed). 5. This pair is supposed to look like a little superscript w with a comma above it: سسّـُس سسُّس (if the comma is '''in''' the w rather than above it, it’s reversed). 6. This pair is supposed to look like a little superscript w with a '''fancy''' comma above it: سسّـٌا سسٌّا (if the fancy comma is '''in''' the w rather than above it, it’s reversed). As I am looking at this note '''before''' I save it, everything on my screen appears correct. After I save it, all six examples will be reversed. You can insert spaces in the examples to separate the vowels, and you should find that they have become the reverse order from the control examples with tashdids (baselines) in them.
I just now sent the above message (# 8) concerning Arabic vowel pairs, and I see that all of the vowel pairs are correct. Clearly, the "bugzilla" software is different from the "en.wiktionary.org" software. If you will copy my examples from the above message into a Wiktionary page, you will see how they become reversed.
Here's the given string broken into groups of base and combining characters: d7 91 U+05D1 HEBREW LETTER BET d6 bc U+05BC HEBREW POINT DAGESH OR MAPIQ < in normalized string, this d6 b7 U+05B7 HEBREW POINT PATAH < sequence is swapped d7 99 U+05D9 HEBREW LETTER YOD d6 b0 U+05B0 HEBREW POINT SHEVA d7 91 U+05D1 HEBREW LETTER BET d6 bc U+05BC HEBREW POINT DAGESH OR MAPIQ < in normalized string, this d6 b7 U+05B7 HEBREW POINT PATAH < sequence is swapped d7 a8 U+05E8 HEBREW LETTER RESH d6 b0 U+05B0 HEBREW POINT SHEVA d7 a1 U+05E1 HEBREW LETTER SAMEKH The only change here in the normalized string is that the dagesh+patah combining sequence is re-ordered into patah+dagesh. I've tried displaying the before and after texts in Internet Explorer 6.0 (Windows XP), in Firefox Deer Park Alpha 2 (Mac OS X 10.4.2), and Safari 2.0 (Mac OS X 10.4.2). The two strings appear the same, even zoomed in, on IE/Win and Firefox/Mac. In Safari the dots are slightly differently positioned. I do not know if this slight different is relevant or 'real'. Python program to confirm that another implementation gives the same results: from unicodedata import normalize before = u"\u05d1\u05bc\u05b7\u05d9\u05b0\u05d1\u05bc\u05b7\u05e8\u05b0\u05e1" after = u"\u05d1\u05b7\u05bc\u05d9\u05b0\u05d1\u05b7\u05bc\u05e8\u05b0\u05e1" coded = normalize("NFC",before) if (coded == before) or (coded != after): print "something is broken" else: print "as expected"
Created attachment 754 [details] Strings from attachment 1 [details] displaying identically in IE 6.0 on Windows XP Professional SP2
Created attachment 755 [details] Hightlighted display difference in Safari on Mac OS X 10.4.2 The dots show slightly displaced in Safari 2.0 on Mac OS X 10.4.2 in the normalized text. Is that movement (from the black dot location to the red dot location) significant? They *do not* display differently in Firefox DeerPark alpha 2 on the same machine. Both string forms display identically on that browser and OS. They *do not* display differently in Internet Explorer 6.0 on Windows XP Professional SP2. Both string forms display identically on that browser and OS.
The problem is only (I think) on win 98 and XP prior to SP2.
I’ve been requesting a fix for the incorrect Arabic normalization (compound vowels) for months, but Arabic still cannot be entered and saved properly in en.wiktionary articles, and I have never received a reply to my requests. I don’t know if I haven’t made myself clear, if no one has had the time, or if no one thinks I know what I’m talking about. I use Firefox 1.0.7 and also IE 6 in Win2K Pro. It makes no difference which browser I use, I cannot save Arabic files correctly in en.wiktionary...nor can anyone else, apparently, because whenever somebody opens an old Arabic article to make some small change, the vowels become incorrectly reversed upon saving. I’ve been typesetting Arabic professionally since the 1970’s and I know how it’s supposed to be written. If you need examples, either here or on en.wiktionary, I can easily provide them. In short, the current normalization produces the wrong results with all compound vowels: shadda+fatha, shadda+kasra, shadda+damma, and shadda+fathatan, shadda+kasratan, shadda+dammatan. In the following examples, (A) = correct, and (X) = wrong: (A) عصَّا ; (X) عصَّا (A) عصِّا ; (X) عصِّا (A) عصُّا ; (X) عصُّا (A) عصًّا ; (X) عصًّا (A) عصٍّا ; (X) عصٍّا (A) عصٌّا ; (X) عصٌّا Under the current normalization, if anyone opens a page containing (A), it will become (X) when he saves it (even if he makes no changes). One example is http://en.wiktionary.org/wiki/حسن , which was written with all the correct vowels prior to implementation of normalization (and which appeared correctly), but has since had to have some of its vowels removed because of this serious problem. I will be happy to explain further if anyone needs clarification.
What I need is a demonstration of incorrect normalization. This is a Unicode standard and, as far as I have been able to test, everything is running according to the standard. Pretty much every current XML-based recommendation, file format standard and protocol these days is recommending use of Unicode normalization form C, which is what we're using. If this breaks Arabic and Hebrew, then a lot of things are going to break it the same way. If there's a difference in rendering, is it: * A bug in the renderer? * Is this an operating system bug? (old versions of Windows) * Is this an application bug? (browser etc) * A bug in the normalization implementation? * A bug in the normalization rules that Unicode defines? * A bug in the Unicode data files? * A corrupt copy of the Unicode data files? The impression I've been given is that it's a bug in old versions of windows and that things render correctly on Windows XP. Can you confirm or contravene this? Can you make a clear, supportable claim that a particular normalized character sequence is incorrectly formed? If so, how should it be formed? Is the correct formation normalized or not? If not why not? If so why isn't it what we get from normalizing the input? Is there an automatic transformation we can do on output? If so what? If there is, should we do so? What are the complications that can arise? Or perhaps the error is in the arrangement of the original input? Where does the input come from and what arranges it? Is it arranged correctly? If not how should it be arranged? How can it be arranged? Is there an automatic transformation we can do on input? If so what? If there is, should we do so? What are the complications that can arise? On these questions I've gotten a lot of nothing. The closest has been an example of a string in 'before' and 'after' state, which appears to render identically in Windows... so what's the problem?
I can confirm that the bug has been fixed in Hebrew in the Service Pack 2 of Win XP but not in earlier versions. If this is the case in Arabic as well, which our Arabic-reading members can check, then we probably should add in the main he.wiki pages and the equivalent Arabic ones an explanation of the problem with a recommendation to upgrade to said OS & Service Pack.
Created attachment 978 [details] Correct rendering of the string "Bibi" with fixed-width font
Created attachment 979 [details] Inorrect rendering of the string "Bibi" with fixed-width font screenshot taken in wiki editor box after pressing 'Show Preview'
If indeed the Unicode normalization rules imply the switching of the DAGESH and the PATAH (as demonstrated in comment #10), then I suppose it's a bug in the renderer. As for the way things _should_ be, it is completely insignificant for a user which way the symbols are stored. In Hebrew (manual) writing it is completely insignificant if the DAGESH is written down before PATAH or vice-versa. When typing text on a computer (at least in Windows), the text is displayed and stored correctly only if the DAGESH is entered first. I haven't here the tools to examine the way it is stored internally, but it's nevertheless renderend correctly any time. This is not the case in Wiki. Once the procedure switches the two symbols, the DAGESH is displayed _outside_ of the BET. An obvious misrendering (see attachments id=978, id=979). I have experienced this bug in Widnows 2000 as well as Windows XP with IE 6.0.x. I believe this should be considered a significant bug as these are highly popular environments. Moreover, Hebrew (and Arabic) are used mostly in scriptures, poetry & transliteration of foreign words and names. Many Wiki pages (especially in Wikitext) contain such texts. The bug renders such text as hard to read and is _very_ appearent to any user that tries to read these texts (and very annoying for myself as I am currently writing about China and constantly need to transliterate Chinese names).
(In reply to comment #19, by Ariel Steiner) Ariel, did you experience this bug in Win XP with Service Pack 2? I use that and see Hebrew with nikkud on wiki perfectly. Others have reported this bug to exist in Win XP with SP1 but without SP2, so I assume it has been fixed in the latter service pack.
I experienced the bug on both WinXP (no SP2) & Win2K, both with IE6 and Firefox 1.0.7. I don't see why a user should upgrade from Win2K (or Me) to WinXP SP2 just because of a nikkud problem
I'd like to add to Ariel's comments that nikkud works perfectly fine in various fonts and on all platforms for word processors: Word for Windows and Open Office. Why should Mediawiki be any different? Don't the word processors also use Unicode? Dovi
Dovi, typical word processors probably aren't applying canonical normalization to text. Ok, spent some time googling around trying to find more background on this. Basically there seem to be two distinct issues: 1) The normalization rules order some nikkud combinations differently from what the font render in old versions of Windows expects. This is a bug in either Windows or the font. From all indications that have been given to me, this is fixed in the current version of Windows (XP Service Pack 2). 2) In some rarer cases appearing in at least Biblical Hebrew, actual semantic information may be lost by application of normalization. This is a bug in the Unicode standard, but it's already established. Some day they may figure out a proper workaround. As for 1), my inclination is to recommend that you upgrade if it's bothering you. Turning off normalization in general would open us up to various weird data corruption, confusing hard-to-reach duplicate pages, easier malicious name spoofing, etc. If Microsoft has already fixed the bug in their product, great. Use the fixed version or try a competing OS. It might be possible to add a postprocessing step to re-order output to what old buggy versions of Windows expect, but this sounds error-prone. As for 2), it's not clear to me whether this is just a phantom problem that _might_ break something or if it's actually breaking text. (Most stuff is probably affected by problem 1.) There's not much we can do about this if it happens other than turning off normalization (and all that entails). Background links: http://www.unicode.org/faq/normalization.html#8 http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html http://lists.ibiblio.org/pipermail/biblical-languages/2003-July/000763.html
Does anybody know if the Windows bugs were in the fonts, in Uniscribe, or in both? Can the new Uniscribe handle the old fonts for instance? If all or part of the problem was with the fonts, then what about 3rd party fonts not under Microsoft's control? Also, has Microsoft issued any kind of fix for OSes other than XP? Has anybody tested this on any Unix or Linux platforms? How does Pango handle this? Without knowing the answers to all these questions, I would lean to a user option to perform a post-normalization compatibility re-ordering.
Hallo! [[en:Wikipedia_talk:Niqqud#Precombined_characters_-_NON-precombined_characters]] relates about some notes received from http://mysite.verizon.net/jialpert/YidText/YiddishOnWeb.htm : Recommendations for Displaying Yiddish text on Web Pages. Depending on platforms, browsers, characters (and fonts?) one may experineced some of the mentioned problems. http://mysite.verizon.net/jialpert/YidText/YiddishOnWeb.htm suggests as "output" preference to use "precombined characters" and to "postpone" "NON-precombined characters" for later days. Consequences: Wikimedia projects should provide at least some notes about the problem (affected platforms / browsers / what to do / how to configure / upgrade to ...) Regards Reinhardt [[user:gangleri]]
Please see also bug 3885: title normalisation
I've tried to check what caused the problem, and I've detected the problem. The problem is in UtfNormal::fastCombiningSort, in the file phase3/includes/normal/UtfNormal.php. It combines the Nikud in the order of its numbers in $utfCombiningClass (defined in phase3/includes/normal/UtfNormalData.inc). This array, unserialized, is shown in [[he:Project:ניקוד#איפיון הבאג]], in the <pre>. You can see Dagesh is 21, and Patah is 18, so they are re-ordered: instead of Dagesh+Patah, we get Patah+Dagesh. But they SHOULD be first Dagesh then Patah, because that's their order - so it's a bug in MediaWiki that we re-order it. In WinXPSP2, they are shown correctly because of a *workaround* (it's not a bugfix there - only a workaround for mistakes), but their order is however wrong. Maybe in Vista it they won't use this workaround. The question is, what does this function (UtfNormal::fastCombiningSort) do? What's it purpose? Why should it sort the Nikud, or anything else? It's already sorted well. How is it related to the normalization? There is any documentation about it? You can just delete the Nikud from the array $utfCombiningClass, if you want to operate the function. Changing the summary, for that's exactly the bug. Also changing the OS and Hardware, because the bug is not only there - the final view problem is there, but the problem exists everywhere. Thank you very much, and please answer my questions in the third paragraph, so we will be able to fix that bug.
(In reply to comment #27) > This array, unserialized, is shown in [[he:Project:ניקוד#איפיון הבאג]], > in the <pre>. Now it's shown in [[User:Rotemliss/Nikud]].
Rotem, this function implements a Unicode standard. The bug is in the standard. Until some future version of Unicode "fixes" this, I'm just going to mark this bug as LATER.
I've slapped up some notes at http://www.mediawiki.org/wiki/Unicode_normalization_considerations
I for one totally support the suggested solution, namely "Remove the normalization check" etc. That would be ideal for the Hebrew Wikipedia since its guidelines strictly forbid the use of nikkud (vowel markers) in its titles, i.e., there are no composed letters in document titles. Seperating the title and display title would also be very convenient because it will allow easy searching on one hand and the use of nikkud in the display title where appropriate.
Incidentally, this is not a "bug" in the Unicode Standard, and won't be fixed later in that standard. The entire issue of canonical ordering of "fixed position" class combining marks for Hebrew has been debated extensively on the Unicode forums, but the outcome isn't about to change, because of requirements for stability of normalization. The problem is in people's interpretation of the *intent* of canonical ordering in the Unicode Standard. (See The Unicode Standard, 5.0. p. 115.) "The canonical order of character sequences does *not* imply any kind of linguistic correctness or linguistic preference for ordering of combining marks in sequences." In effect, the Unicode Standard is agnostic about the input order or linguistically preferred order of dagesh+patah (or patah+dagesh). What normalization (and canonical ordering) *do* imply, however, is that the two sequences are to be interpreted as equivalent. It sounds to me like Mediawiki is implementing Unicode normalization correctly. The bug, if anything, is in the *rendering* of the sequences, as implied by some of the earlier comments on this. dagesh+patah or patah+dagesh should render identically -- there is no intent that they stack in some different way dependent on their ordering when rendered. The original intent of the fixed position combining classes in the standard was that they applied to combining marks whose *positions were fixed* -- in other words, the dagesh goes where the dagesh is supposed to go, and the patah goes where the patah is supposed to go, regardless of which order they were entered or stored. Also, it should be noted that the Unicode Standard does not impose any requirement that Unicode text be stored in normalized form. Wikimedia is free to normalize or not, depending on its needs and contexts. Normalization to NFC in most contexts is probably a good idea, however, as it simplifies comparisons, sorts, and searches. But as in this particular case for Hebrew, you can run into issues in the display of normalized text, if your rendering system and/or fonts are not quite up to snuff regarding the placement of sequences of marks for pointed Hebrew text. --Ken Whistler, Unicode 5.0 editor
Hebrew vowelization seems much improved in Firefox 3. It would be nice to know exactly what changed and how, and have these things documented in case there are future problems. Firefox 3 seems to correctly represent the vowel order for webpages in general and Wikimedia pages in particular. The only anomaly I nevertheless found is that pasting vowelized text into the edit page only shows partial vowelization. On the "saved" wiki page it appears correctly.
(In reply to comment #33) > Hebrew vowelization seems much improved in Firefox 3. It would be nice to know > exactly what changed and how, and have these things documented in case there > are future problems. > > Firefox 3 seems to correctly represent the vowel order for webpages in general > and Wikimedia pages in particular. > > The only anomaly I nevertheless found is that pasting vowelized text into the > edit page only shows partial vowelization. On the "saved" wiki page it appears > correctly. > The bug of showing the Dagesh and other vowels in the wrong order usually depends on operating system. For example, Windows XP (possibly only with Service Pack 2) displays it well, while older Windows systems don't. However, Firefox 3.0 did fix some Hebrew vowels bugs, like the problem with Nikud with justified text (see https://bugzilla.mozilla.org/show_bug.cgi?id=60546 ).
*** Bug 14834 has been marked as a duplicate of this bug. ***
Since this bug also effects Myanmar exactly in the same way, could the title be appended with Myanmar as well? Normalization is not taking place the way it should. Here is the sort sequence it should be as specified in Unicode Technical Note #11. Name Specification Consonant [U+1000 .. U+102A, U+103F, U+104E] Asat3 U+103A Stacked U+1039 [U+1000 .. U+1019, U+101C, U+101E, U+1020, U+1021] Medial Y U+103B Medial R U+103C Medial W U+103D Medial H U+103E E vowel U+1031 Upper Vowel [U+102D, U+102E, U+1032] Lower Vowel [U+102F, U+1030] A Vowel [U+102B, U+102C] Anusvara U+1036 Visible virama U+103A Lower Dot U+1037 Visarga U+1038 I can provide more technical detail if needed. Hence U+1037 should always come after U+103A (even though U+103A is 'higher'). And U+1032 should come _before_ U+102F, U+1030, U+102B, U+102C and so on. I noticed that this but is related more to Unicode Normalization than it is to MediaWiki itself. But an important question I have is *can* Unicode Normalization Check be disabled for Myanmar Wikipedia while we try to resolve it? Thanks, because that would be very helpful?
(In reply to comment #36) > Since this bug also effects Myanmar exactly in the same way, could the title be > appended with Myanmar as well? You can do things like that yourself here. > But an important > question I have is *can* Unicode Normalization Check be disabled for Myanmar > Wikipedia while we try to resolve it? Thanks, because that would be very > helpful? See [[mw:Unicode normalization concerns]]. This is feasible. We could turn off normalization for article text and leave it for titles, which would allow DISPLAYTITLE to be used to work around ugly display in titles. However, it would require some work.
I would prefer normalization as there are benefits from it, since it enforces a particular sequence. My question now is what kind of data should I provide to Brion Vibber so that he can implement the normalization for Myanmar? Our case is quite different from Hebrew and is more straight forward. I believe UTN#11 V2 would be sufficient? It was updated recently for Unicode 5.1 I would like to wait a while before actually thinking of disabling for article text and using work around for titles. If it can be implemented we won't need to off normalization, and would benefit from it. Thanks.
It would almost certainly be a bad idea to use different normalization for a single wiki. This would create complications when trying to, for instance, import pages. If this is genuinely an issue for Myanmar, we should fix it in the core software for all MediaWiki wikis that contain any Myanmar text. Same for Hebrew and Arabic. What exactly is the issue here? Some user agents render theoretically equivalent sequences of code points differently, so normalization changes display? Which user agents are these?
Created attachment 5078 [details] Relative Order (Normalization?) for Unicode 5.1 Myanmar
Created attachment 5079 [details] Relative Order (Normalization?) for pre-Unicode 5.1/Myanmar
I have attached two images. The first one shows normalization sequence for 5.1, and the 2nd one shows normalization sequence for pre Unicode 5.1. It is drastically different. The copy of those two can be found here. http://unicode.org/notes/tn11/myanmar_uni-v2.pdf Page 4 for latest, and page 9 for deprecated. The normalization done at MediaWiki seems to be for pre 5.1. I am added pre 5.1 table here. Name Specification kinzi U+1004 U+1039 Consonant [U+1000 .. U+102A] Stacked U+1039 [U+1000 .. U+1019, U+101C, U+101E, U+1020, U+1021] Medial Y U+1039 U+101A Medial R U+1039 U+101B Medial W U+1039 U+101D Medial H U+1039 U+101F E vowel U+1031 Lower Vowel [U+102F, U+1030] Upper Vowel [U+102D, U+102E, U+1032] A Vowel U+102C Anusvara U+1036 Visible virama U+1039 U+200C Lower Dot U+1037 Visarga U+1038 Yes, normalization changes display. I have attached a jpeg file showing the error caused here https://bugzilla.wikimedia.org/show_bug.cgi?id=14834
Created attachment 5080 [details] Contents of includes/normal/UtfNormalData.inc As far as I can tell, MediaWiki is indeed using the 5.1 tables. I've attached the data used for normalization, which is generated by a script that downloads the appropriate files from http://www.unicode.org/Public/5.1.0/ucd/. If you can spot an error, please say what it is. You might want to talk to Tim Starling, since as far as I can tell he's the one who wrote this.
U+1037 is int(7) and U+103A is int(9), this means that U+1037 should always be first? This seems so similar to the pitah-dagesh issue. :( This is the relevant section of $utfCombingClass: ["့"]=> int(7) ["္"]=> int(9) ["်"]=> int(9) The order given here does not seem to be the same as the order given in UTN#11. I guess this would be a lesson not to take UTN's too seriously. I do like the sort order as it is in Wikipedia, just that it's having problems with Fonts. And I am a bit surprised that data in UCS does not match what was authored in the UTN. So as far as MedaiWiki is concerned, it's just like the way it is with Hebrew. We will now need to move over to Unicode mailing list and ask what's going on. Simetrical, many thanks for clearing this one up for me. :) As a side note developer of Parabaik font gave me this link http://ngwestar.googlepages.com/padaukvsmyanmar3 I noticed that the sequence was recently changed to have been mentioned.
Found something which should not have been re-sequenced. Input: U+101E U+1004 U+103A U+1039 U+1001 U+103B U+102C Output: U+101E U+1001 U+103B U+102C U+1004 U+103A U+1039 The output is wrong because U+1004 is consonant, and U+1001 is also consonant. Hence MedaiWiki should not have swapped them, that is if my understanding of Unicode Normalization is correct. My understanding is that the sorting starts over whenever a new consonant starts, because this is the beginning of a new syllable cluster. No fonts will be able to render the output from Mediawiki.
I suggest you e-mail Tim Starling.
I am adding it here that the issue with Myanmar Unicode (Lower Dot and Visible Virama) is an issue that will be covered in the revision to UTN#11 as a foresight in the standards review process. Due to the stability criteria of UnicodeData.txt there is nothing we can do about this. This is not a MediaWiki bug, since many people are now referencing this to point out as a bug, I need to clarify this here. This sadly does mean that fonts and IMEs will need to update and mean while MediaWiki 1.4 will have the problem mentioned here and the way to resolve this is to simply wait update fonts and IMEs. The advantage of turning off Normalization far outweigh the disadvantage. If there are plans to adopt a less invasive normalization process as mentioned in Normalization Concerns than the issue can be resolved. The developers of fonts and IMEs have agreed to update so those implementing MediaWiki install bases might want to keep Normalization on. The 2nd issue with Kinzi (comment #45) seems to be resolved now. Was MediaWiki updated between July and now??
FYI: https://bugzilla.wikimedia.org/show_activity.cgi?id=2399 I did not change priorities; I only added me as CC:. It seams that the Priority field is gone.
Marking REOPENED. The standard was updated since 2006. We discussed this in the Berlin Hackathon.
See another demonstration of this problem here: http://en.wikisource.org/wiki/User:Amire80/Havrakha
Assigning to me so we can look over the current state and see about fixing it up.
Apparently, you have not implemnted the contractions and expansions of UCA. Note that there has been NO change in Unicode 5.1 (or later) for the normalization which is now stabilized since at least Unicode 4.0.1. The bugs above are most probably not related to normalization, if it is implemented correctly (and normalization is an easy problem that can be implemtned very efficiently). And the changes in the DUCET (or now the CLDR DUCET) do not affect how Hebrew, Arabic or Myanmar is sorted, within the same script. Then you should learn to separate the Unicode Normalization Algorithm (UNA), the Unicode Collation Algorithm (UCA), and the Unicode Bidi Algorithm (UBA), because the Bidi algorithm only affects the display, but definitely NOT the other two. And the order produced by normalization is orthogonal to the order of collation weights generated by UCA, even if normalization is assumed to be performed first before computing collations (but this is not a requirement, it just helps reducing the problem, by making sure that canonically equivalent strings will collate the same. Many posters above seem to be completely mixing the problems !
Note: for Thai, Lao, Tai Viet, the normalization does not reorder the prepended vowels (neither do the Bidi algorithm). But such reordering is *required* when implementing the UCA, and this takes the form of contractions and expansions, that are present in the DUCET for these scripts.
Final note: it is highly recommanded to NOT save texts with an implicit normalization. Even if normalization is implemted correctly. There are known defects (yes bugs in renderers of browsers that frequently do not implement normalizations and that are not able to sort, combine and position the diacritics correctly if they are not in a specific order, which is not the same as the normalized order) There are also because incorrect assumptions made by writers (that have not understood when and where to insert CGJ to restrict the normalization of reordering some pairs of diacritics), and so have written their texts in such a way that they "seem" to render correctly, but only on a bogous browser not performing the normalizations correctly and/or with strong limitations in their text renderer (unable to recognize strings that are canonically equivalent but for which they expect only one order for successive diacritics in order to position them correctly). This type of defects is typical of the "bug" described above about the normalized order of the DAGESH (a central point in the middle of a consonannt letter, in order to modify it) or SIN/SHIN DOTS (above the letter, on the left or right, also modifying the consonnant), and the other Hebrew vowel diacritics: Yes the normalization reorders the vowel diacritics before the diacritics that modify the consonnant (this is the effect of an old assignment of their relative "combining classes", in a completely illogical order of values, but this will NEVER be changed as it would affect the normalizations). But many renderers are not able to display correctly the strings that are encoded in normalized order (base consonnant, vowel diacritic, sin dot or shin dot or dagesh). Instead they expect that the string will be encoded as (base consonnant, dagesh or sin dot or shin dot, vowel diacritic), even if it is completely canonically equivalent to the previous and should display exactly the same ! (such rendering bugs were found in old versions of Windows with IE6 or before). For this reason, you should not, on MediaWiki, apply any implicit renormalization of any edited text. If one wants to enter (base consonnant, dagesh or sin dot or shin dot, vowel diacritic) in the Wiki text, keep it unchanged, do not normalize it, as it will display correctly on both the old bogous renderers and on newer ones.
All my remarks in the previous message also apply to the Arabic diacritics. For example the assumptions made by Brion Viber in his message #23 are completely wrong. He has not understood what is normalization and the fact that, only with conforming renderers, the normalization *must not* affect the rendering (but if they do, this is due to bugs in renderers, not bugs in the normalizer used on MediaWiki).
*** Bug 31183 has been marked as a duplicate of this bug. ***
This should probably be reassigned to one of our localization engineers.
reassigned to Amir as he is part of localization engineers. This bug is still present as can seen in : https://en.wikisource.org/wiki/User:Amire80/Havrakha
For an extremely clear description of the problem in Hebrew, see here (pp. 8 ff.): http://www.sbl-site.org/Fonts/SBLHebrewUserManual1.5x.pdf
Amir: Do you (or the L10N team) plan to take a look at this at some point? This ticket is place 14 in the list of open tickets with the highest votes...