Last modified: 2014-11-17 10:16:58 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T4399, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 2399 - Unicode normalization "sorts" Hebrew/Arabic/Myanmar vowels wrongly


Summary:	Unicode normalization "sorts" Hebrew/Arabic/Myanmar vowels wrongly

Status:	NEW

Product:	MediaWiki
Classification:	Unclassified
Component:	Internationalization (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Lowest major with 29 votes (vote)
Target Milestone:	---
Assigned To:	Amir E. Aharoni

URL:	http://www.mediawiki.org/wiki/Unicode...
Whiteboard:
Keywords:

Duplicates:	14834 31183 (view as bug list)
Depends on:
Blocks:	rtl unicode 30673 1527
	Show dependency tree / graph

Reported:	2005-06-13 13:08 UTC by Andrew Dunbar
Modified:	2014-11-17 10:16 UTC (History)
CC List:	16 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
a txt file in utf-8 (250 bytes, application/octet-stream) 2005-07-28 19:06 UTC, Eran Roz	Details
Strings from attachment 1 displaying identically in IE 6.0 on Windows XP Professional SP2 (469 bytes, image/png) 2005-07-30 20:04 UTC, Brion Vibber	Details
Hightlighted display difference in Safari on Mac OS X 10.4.2 (3.87 KB, image/png) 2005-07-30 20:10 UTC, Brion Vibber	Details
Correct rendering of the string "Bibi" with fixed-width font (254 bytes, image/bitmap) 2005-10-11 15:19 UTC, Ariel Steiner	Details
Inorrect rendering of the string "Bibi" with fixed-width font (350 bytes, image/bitmap) 2005-10-11 15:21 UTC, Ariel Steiner	Details
Relative Order (Normalization?) for Unicode 5.1 Myanmar (39.76 KB, image/png) 2008-07-16 21:03 UTC, Ravi Chhabra	Details
Relative Order (Normalization?) for pre-Unicode 5.1/Myanmar (110.40 KB, image/png) 2008-07-16 21:04 UTC, Ravi Chhabra	Details
Contents of includes/normal/UtfNormalData.inc (322.79 KB, text/plain) 2008-07-16 22:26 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)	Details
Add an attachment (proposed patch, testcase, etc.)

Description Andrew Dunbar 2005-06-13 13:08:08 UTC

I can't find a bug report but there is discussion of the Hebrew case here: 
http://en.wikipedia.org/wiki/Wikipedia:Niqqud
The Hebrew case seems to have been known for some time.

Now we are noticing a similar problem with Arabic on Wiktionary. There is some 
discussion here: http://en.wiktionary.org/wiki/Talk:%D8%AC%D8%AF%D8%A7

Comment 1 Nahum Wengrov 2005-07-28 10:53:55 UTC

The bug as I noticed it, is caused by the special characters used for vowels, 
dagesh, right & left shin etc. not being sorted properly by wiki / probably not 
being recognized as RTL. 

Lots of free texts in Hebrew are quite ancient and depend on Niqqud to be read 
properly, so fixing this bug should take a high priority IMHO.

Comment 2 Brion Vibber 2005-07-28 16:30:11 UTC

Input text is checked for valid UTF-8 and normalized to Unicode Normalization Form C (canonical composed form).

Someone needs to provide:
* Short and exact before and after examples
* If possible, a comparison against other Unicode normalization implementations to show if we're performing 
normalization incorrectly

If there is an error in my normalization implementation, and it can be narrowed down, I'd be happy to fix it.
If this is the result of the correct normalization algorithm, I'm not really sure what to do.

Comment 3 Dovi Jacobs 2005-07-28 18:28:02 UTC

For a typical before and after example, see the following comaprison of versions:

http://he.wikisource.org/w/index.php?title=%D7%90%D7%92%D7%A8%D7%AA_%D7%94%D7%A8%D7%9E%D7%91%22%D7%9F&diff=2794&oldid=1503

In that example, the only change actually made by the user was adding a category
at the end, but when the text was saved, the order of vowels was altered in most
of the words in the text.

If what Brion means is an example of a single word or something like that, it
will be hard to provide examples because only texts contributed until December
show "before" examples.

However, maybe this will help: When vowelized texts from word processors like
Word and Open Office are pasted into Wiki edit boxes, the vowels are
automatically changed to the wrong positions in the wiki coding.

Comment 4 JeLuF 2005-07-28 18:36:26 UTC

Dovi, what browser are you using, and which version of it? Which operating system?

Looking at the diff that you provided, checking the first few lines, those look
OK to me.
All the letters are identical on the right and on the left.

Comment 5 JeLuF 2005-07-28 18:41:22 UTC

Comparing with Brion's laptop (he uses MacOS 10.4, I use 10.3.9) the letters
differ between mine and his. There are dots in some of Brion's letters where I
don't see any.

Comment 6 Brion Vibber 2005-07-28 18:44:09 UTC

(I was testing in Safari and JeLuF in Firefox. They may render differently, or have been using different fonts...)

Yes, I would very much like to get individual words. You can copy them out of the Wikipedia pages if you like.

Very very helpful for each of these would be:
* The 'before' formatting, saved in a UTF-8 text file (notepad on Windows XP is ok for this)
* The 'after' formatting, saved in a UTF-8 text file
* A detailed, close-up rendering of what it's supposed to look like
  (screen shot of 'before' correctly rendered, using a large enough font size I can tell the difference)
* A detailed, close-up rendering of what it ends up looking like

If possible, a description of which bits have moved or changed and how this affects the reading of the text.

Comment 7 Eran Roz 2005-07-28 19:06:15 UTC

Created attachment 751 [details]
a txt file in utf-8

Comment 8 Stephen G. Brown 2005-07-30 09:45:37 UTC

I’m using IE 6 in Win2K Professional, and 
I’ve been seeing this problem as well. Text 
that I created a year or so ago in Arabic 
are fine, but if I now open and re-save them 
(using all of the same software as before), 
Arabic vowel pairs become reversed. I can 
provide you here with some examples, one 
with the vowels together, and another 
separating the vowels with a tashdid 
(baseline) ... then you can remove the 
tashdid and bring the vowels together to see 
what happens. (Tahoma would be a good font 
to see this.)

1. This pair is supposed to look like a 
little superscript w with an '''over'''line: 
سسّـَس سسَّس (if you get an '''under'''lined w, 
it’s reversed).
2. This pair is supposed to look like a 
little superscript w with 
an '''under'''line: سسّـِس سسِّس (if the 
underline is below the entire '''word''' 
rather than below the little '''w''', it’s 
reversed).
3. This pair is supposed to look like a 
little superscript w with a '''double 
over'''line: سسّـًا سسًّا (if you get a w with 
double '''under'''line, it’s reversed).
4. This pair is supposed to look like a 
little superscript w with a '''double 
under'''line: سسّـٍا سسٍّا (if the double 
underline is below the entire word rather 
than below the little w, it’s reversed).
5. This pair is supposed to look like a 
little superscript w with a comma above it: 
سسّـُس سسُّس (if the comma is '''in''' the w 
rather than above it, it’s reversed).
6. This pair is supposed to look like a 
little superscript w with a '''fancy''' 
comma above it: سسّـٌا سسٌّا (if the fancy comma 
is '''in''' the w rather than above it, it’s 
reversed).

As I am looking at this note '''before''' I 
save it, everything on my screen appears 
correct. After I save it, all six examples 
will be reversed. You can insert spaces in 
the examples to separate the vowels, and you 
should find that they have become the 
reverse order from the control examples with 
tashdids (baselines) in them.

Comment 9 Stephen G. Brown 2005-07-30 09:54:53 UTC

I just now sent the above message (# 8) 
concerning Arabic vowel pairs, and I see 
that all of the vowel pairs are correct. 
Clearly, the "bugzilla" software is 
different from the "en.wiktionary.org" 
software.

If you will copy my examples from the above 
message into a Wiktionary page, you will see 
how they become reversed.

Comment 10 Brion Vibber 2005-07-30 20:02:36 UTC

Here's the given string broken into groups of base and combining characters:

d7 91  U+05D1  HEBREW LETTER BET
d6 bc  U+05BC  HEBREW POINT DAGESH OR MAPIQ < in normalized string, this
d6 b7  U+05B7  HEBREW POINT PATAH           < sequence is swapped

d7 99  U+05D9  HEBREW LETTER YOD
d6 b0  U+05B0  HEBREW POINT SHEVA

d7 91  U+05D1  HEBREW LETTER BET
d6 bc  U+05BC  HEBREW POINT DAGESH OR MAPIQ < in normalized string, this
d6 b7  U+05B7  HEBREW POINT PATAH           < sequence is swapped

d7 a8  U+05E8  HEBREW LETTER RESH
d6 b0  U+05B0  HEBREW POINT SHEVA

d7 a1  U+05E1  HEBREW LETTER SAMEKH


The only change here in the normalized string is that the dagesh+patah
combining sequence is re-ordered into patah+dagesh.

I've tried displaying the before and after texts in Internet Explorer 6.0
(Windows XP), in Firefox Deer Park Alpha 2 (Mac OS X 10.4.2), and Safari 2.0
(Mac OS X 10.4.2). The two strings appear the same, even zoomed in, on IE/Win
and Firefox/Mac. In Safari the dots are slightly differently positioned.
I do not know if this slight different is relevant or 'real'.


Python program to confirm that another implementation gives the same results:

from unicodedata import normalize
before = u"\u05d1\u05bc\u05b7\u05d9\u05b0\u05d1\u05bc\u05b7\u05e8\u05b0\u05e1"
after  = u"\u05d1\u05b7\u05bc\u05d9\u05b0\u05d1\u05b7\u05bc\u05e8\u05b0\u05e1"
coded = normalize("NFC",before)
if (coded == before) or (coded != after):
    print "something is broken"
else:
    print "as expected"

Comment 11 Brion Vibber 2005-07-30 20:04:15 UTC

Created attachment 754 [details]
Strings from attachment 1 [details] displaying identically in IE 6.0 on Windows XP Professional SP2

Comment 12 Brion Vibber 2005-07-30 20:10:08 UTC

Created attachment 755 [details]
Hightlighted display difference in Safari on Mac OS X 10.4.2

The dots show slightly displaced in Safari 2.0 on Mac OS X 10.4.2 in the
normalized text.
Is that movement (from the black dot location to the red dot location)
significant?

They *do not* display differently in Firefox DeerPark alpha 2 on the same
machine.
Both string forms display identically on that browser and OS.

They *do not* display differently in Internet Explorer 6.0 on Windows XP
Professional SP2.
Both string forms display identically on that browser and OS.

Comment 13 Nahum Wengrov 2005-08-03 10:30:27 UTC

The problem is only (I think) on win 98 and XP prior to SP2.

Comment 14 Stephen G. Brown 2005-10-07 10:06:29 UTC

I’ve been requesting a fix for the incorrect 
Arabic normalization (compound vowels) for 
months, but Arabic still cannot be entered 
and saved properly in en.wiktionary 
articles, and I have never received a reply 
to my requests. I don’t know if I haven’t 
made myself clear, if no one has had the 
time, or if no one thinks I know what I’m 
talking about.

I use Firefox 1.0.7 and also IE 6 in Win2K 
Pro. It makes no difference which browser I 
use, I cannot save Arabic files correctly in 
en.wiktionary...nor can anyone else, 
apparently, because whenever somebody opens 
an old Arabic article to make some small 
change, the vowels become incorrectly 
reversed upon saving.

I’ve been typesetting Arabic professionally 
since the 1970’s and I know how it’s 
supposed to be written. If you need 
examples, either here or on en.wiktionary, I 
can easily provide them.

In short, the current normalization produces 
the wrong results with all compound vowels: 
shadda+fatha, shadda+kasra, shadda+damma, 
and shadda+fathatan, shadda+kasratan, 
shadda+dammatan. In the following examples, 
(A) = correct, and (X) = wrong:
(A) عصَّا ; (X) عصَّا
(A) عصِّا ; (X) عصِّا
(A) عصُّا ; (X) عصُّا
(A) عصًّا ; (X) عصًّا 
(A) عصٍّا ; (X) عصٍّا
(A) عصٌّا ; (X) عصٌّا

Under the current normalization, if anyone 
opens a page containing (A), it will become 
(X) when he saves it (even if he makes no 
changes). One example is 
http://en.wiktionary.org/wiki/حسن , which 
was written with all the correct vowels 
prior to implementation of normalization 
(and which appeared correctly), but has 
since had to have some of its vowels removed 
because of this serious problem.

I will be happy to explain further if anyone 
needs clarification.

Comment 15 Brion Vibber 2005-10-07 21:19:20 UTC

What I need is a demonstration of incorrect normalization. This
is a Unicode standard and, as far as I have been able to test,
everything is running according to the standard.

Pretty much every current XML-based recommendation, file format
standard and protocol these days is recommending use of Unicode
normalization form C, which is what we're using. If this breaks
Arabic and Hebrew, then a lot of things are going to break it
the same way.

If there's a difference in rendering, is it:
* A bug in the renderer?
* Is this an operating system bug? (old versions of Windows)
* Is this an application bug? (browser etc)
* A bug in the normalization implementation?
* A bug in the normalization rules that Unicode defines?
* A bug in the Unicode data files?
* A corrupt copy of the Unicode data files?

The impression I've been given is that it's a bug in old versions
of windows and that things render correctly on Windows XP. Can
you confirm or contravene this?

Can you make a clear, supportable claim that a particular normalized
character sequence is incorrectly formed? If so, how should it be
formed? Is the correct formation normalized or not? If not why not?
If so why isn't it what we get from normalizing the input?

Is there an automatic transformation we can do on output? If so what?
If there is, should we do so? What are the complications that can arise?

Or perhaps the error is in the arrangement of the original input?
Where does the input come from and what arranges it? Is it arranged
correctly? If not how should it be arranged? How can it be arranged?

Is there an automatic transformation we can do on input? If so what?
If there is, should we do so? What are the complications that can arise?

On these questions I've gotten a lot of nothing. The closest has been
an example of a string in 'before' and 'after' state, which appears to
render identically in Windows... so what's the problem?

Comment 16 Nahum Wengrov 2005-10-08 15:42:46 UTC

I can confirm that the bug has been fixed in Hebrew in the Service Pack 2 of Win 
XP but not in earlier versions. If this is the case in Arabic as well, which our 
Arabic-reading members can check, then we probably should add in the main he.wiki 
pages and the equivalent Arabic ones an explanation of the problem with a 
recommendation to upgrade to said OS & Service Pack.

Comment 17 Ariel Steiner 2005-10-11 15:19:53 UTC

Created attachment 978 [details]
Correct rendering of the string "Bibi" with fixed-width font

Comment 18 Ariel Steiner 2005-10-11 15:21:13 UTC

Created attachment 979 [details]
Inorrect rendering of the string "Bibi" with fixed-width font

screenshot taken in wiki editor box after pressing 'Show Preview'

Comment 19 Ariel Steiner 2005-10-11 15:32:54 UTC

If indeed the Unicode normalization rules imply the switching of the DAGESH and 
the PATAH (as demonstrated in comment #10), then I suppose it's a bug in the 
renderer. 
As for the way things _should_ be, it is completely insignificant for a user which 
way the symbols are stored. In Hebrew (manual) writing it is completely 
insignificant if the DAGESH is written down before PATAH or vice-versa. When 
typing text on a computer (at least in Windows), the text is displayed and stored 
correctly only if the DAGESH is entered first. I haven't here the tools to examine 
the way it is stored internally, but it's nevertheless renderend correctly any 
time. This is not the case in Wiki. Once the procedure switches the two symbols, 
the DAGESH is displayed _outside_ of the BET. An obvious misrendering (see 
attachments id=978, id=979).
I have experienced this bug in Widnows 2000 as well as Windows XP with IE 6.0.x.
I believe this should be considered a significant bug as these are highly popular 
environments. Moreover, Hebrew (and Arabic) are used mostly in scriptures, poetry 
& transliteration of foreign words and names. Many Wiki pages (especially in 
Wikitext) contain such texts. The bug renders such text as hard to read and is 
_very_ appearent to any user that tries to read these texts (and very annoying for 
myself as I am currently writing about China and constantly need to transliterate 
Chinese names).

Comment 20 Nahum Wengrov 2005-10-11 15:47:38 UTC

(In reply to comment #19, by Ariel Steiner)
Ariel, did you experience this bug in Win XP with Service Pack 2? I use that and 
see Hebrew with nikkud on wiki perfectly. Others have reported this bug to exist 
in Win XP with SP1 but without SP2, so I assume it has been fixed in the latter 
service pack.

Comment 21 Ariel Steiner 2005-10-15 07:21:03 UTC

I experienced the bug on both WinXP (no SP2) & Win2K, both with IE6 and Firefox
1.0.7. I don't see why a user should upgrade from Win2K (or Me) to WinXP SP2
just because of a nikkud problem

Comment 22 Dovi Jacobs 2005-10-16 08:44:28 UTC

I'd like to add to Ariel's comments that nikkud works perfectly fine in various 
fonts and on all platforms for word processors: Word for Windows and Open Office. 
Why should Mediawiki be any different? Don't the word processors also use Unicode?

Dovi

Comment 23 Brion Vibber 2005-10-16 10:23:25 UTC

Dovi, typical word processors probably aren't applying canonical normalization to 
text.

Ok, spent some time googling around trying to find more background on this. Basically 
there seem to be two distinct issues:

1) The normalization rules order some nikkud combinations differently from what the 
font render in old versions of Windows expects. This is a bug in either Windows or 
the font. From all indications that have been given to me, this is fixed in the 
current version of Windows (XP Service Pack 2).

2) In some rarer cases appearing in at least Biblical Hebrew, actual semantic 
information may be lost by application of normalization. This is a bug in the Unicode 
standard, but it's already established. Some day they may figure out a proper 
workaround.

As for 1), my inclination is to recommend that you upgrade if it's bothering you. 
Turning off normalization in general would open us up to various weird data 
corruption, confusing hard-to-reach duplicate pages, easier malicious name spoofing, 
etc. If Microsoft has already fixed the bug in their product, great. Use the fixed 
version or try a competing OS.

It might be possible to add a postprocessing step to re-order output to what old 
buggy versions of Windows expect, but this sounds error-prone.

As for 2), it's not clear to me whether this is just a phantom problem that _might_ 
break something or if it's actually breaking text. (Most stuff is probably affected 
by problem 1.) There's not much we can do about this if it happens other than turning 
off normalization (and all that entails).


Background links:
http://www.unicode.org/faq/normalization.html#8
http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html
http://lists.ibiblio.org/pipermail/biblical-languages/2003-July/000763.html

Comment 24 Andrew Dunbar 2005-10-16 15:10:24 UTC

Does anybody know if the Windows bugs were in the fonts, in 
Uniscribe, or in both? Can the new Uniscribe handle the old 
fonts for instance?

If all or part of the problem was with the fonts, then what 
about 3rd party fonts not under Microsoft's control?

Also, has Microsoft issued any kind of fix for OSes other than 
XP?

Has anybody tested this on any Unix or Linux platforms? How does 
Pango handle this?

Without knowing the answers to all these questions, I would lean 
to a user option to perform a post-normalization compatibility 
re-ordering.

Comment 25 lɛʁi לערי ריינהארט 2005-10-27 22:43:51 UTC

Hallo!

[[en:Wikipedia_talk:Niqqud#Precombined_characters_-_NON-precombined_characters]]
relates about some notes received from
http://mysite.verizon.net/jialpert/YidText/YiddishOnWeb.htm : Recommendations
for Displaying Yiddish text on Web Pages.

Depending on platforms, browsers, characters (and fonts?) one may experineced
some of the mentioned problems.

http://mysite.verizon.net/jialpert/YidText/YiddishOnWeb.htm suggests as "output"
preference to use "precombined characters" and to "postpone" "NON-precombined
characters" for later days.

Consequences: Wikimedia projects should provide at least some notes about the
problem (affected platforms / browsers / what to do / how to configure / upgrade
to ...)

Regards Reinhardt [[user:gangleri]]

Comment 26 lɛʁi לערי ריינהארט 2005-11-05 01:03:16 UTC

Please see also
bug 3885: title normalisation

Comment 27 Rotem Liss 2006-04-07 18:21:54 UTC

I've tried to check what caused the problem, and I've detected the problem.

The problem is in UtfNormal::fastCombiningSort, in the file
phase3/includes/normal/UtfNormal.php. It combines the  Nikud in the order of its
numbers in $utfCombiningClass (defined in
phase3/includes/normal/UtfNormalData.inc). This array, unserialized, is shown in
[[he:Project:ניקוד#איפיון הבאג]], in the <pre>. You can see Dagesh is 21, and
Patah is 18, so they are re-ordered: instead of Dagesh+Patah, we get
Patah+Dagesh. But they SHOULD be first Dagesh then Patah, because that's their
order - so it's a bug in MediaWiki that we re-order it. In WinXPSP2, they are
shown correctly because of a *workaround* (it's not a bugfix there - only a
workaround for mistakes), but their order is however wrong. Maybe in Vista it
they won't use this workaround.

The question is, what does this function (UtfNormal::fastCombiningSort) do?
What's it purpose? Why should it sort the Nikud, or anything else? It's already
sorted well. How is it related to the normalization? There is any documentation
about it?

You can just delete the Nikud from the array $utfCombiningClass, if you want to
operate the function.

Changing the summary, for that's exactly the bug. Also changing the OS and
Hardware, because the bug is not only there - the final view problem is there,
but the problem exists everywhere.

Thank you very much, and please answer my questions in the third paragraph, so
we will be able to fix that bug.

Comment 28 Rotem Liss 2006-04-07 18:25:23 UTC

(In reply to comment #27)
> This array, unserialized, is shown in [[he:Project:ניקוד#איפיון הבאג]],
> in the <pre>.

Now it's shown in [[User:Rotemliss/Nikud]].

Comment 29 Brion Vibber 2006-04-07 19:13:07 UTC

Rotem, this function implements a Unicode standard. The bug is in the standard.
Until some future version of Unicode "fixes" this, I'm just going to mark this
bug as LATER.

Comment 30 Brion Vibber 2006-05-22 19:08:58 UTC

I've slapped up some notes at
http://www.mediawiki.org/wiki/Unicode_normalization_considerations

Comment 31 Ariel Steiner 2006-05-22 19:36:02 UTC

I for one totally support the suggested solution, namely "Remove the
normalization check" etc.
That would be ideal for the Hebrew Wikipedia since its guidelines strictly
forbid the use of nikkud (vowel markers) in its titles, i.e., there are no
composed letters in document titles. Seperating the title and display title
would also be very convenient because it will allow easy searching on one hand
and the use of nikkud in the display title where appropriate.

Comment 32 Ken Whistler 2007-01-23 22:33:12 UTC

Incidentally, this is not a "bug" in the Unicode Standard, and won't be fixed 
later in that standard. The entire issue of canonical ordering of "fixed 
position" class combining marks for Hebrew has been debated extensively on the 
Unicode forums, but the outcome isn't about to change, because of requirements 
for stability of normalization.

The problem is in people's interpretation of the *intent* of canonical 
ordering in the Unicode Standard. (See The Unicode Standard, 5.0. p. 
115.) "The canonical order of character sequences does *not* imply any kind of 
linguistic correctness or linguistic preference for ordering of combining 
marks in sequences." In effect, the Unicode Standard is agnostic about the 
input order or linguistically preferred order of dagesh+patah (or 
patah+dagesh). What normalization (and canonical ordering) *do* imply, 
however, is that the two sequences are to be interpreted as equivalent.

It sounds to me like Mediawiki is implementing Unicode normalization correctly.

The bug, if anything, is in the *rendering* of the sequences, as implied by 
some of the earlier comments on this. dagesh+patah or patah+dagesh should 
render identically -- there is no intent that they stack in some different way 
dependent on their ordering when rendered. The original intent of the fixed 
position combining classes in the standard was that they applied to combining 
marks whose *positions were fixed* -- in other words, the dagesh goes where 
the dagesh is supposed to go, and the patah goes where the patah is supposed 
to go, regardless of which order they were entered or stored.

Also, it should be noted that the Unicode Standard does not impose any 
requirement that Unicode text be stored in normalized form. Wikimedia is free 
to normalize or not, depending on its needs and contexts. Normalization to NFC 
in most contexts is probably a good idea, however, as it simplifies 
comparisons, sorts, and searches. But as in this particular case for Hebrew, 
you can run into issues in the display of normalized text, if your rendering 
system and/or fonts are not quite up to snuff regarding the placement of 
sequences of marks for pointed Hebrew text.

--Ken Whistler, Unicode 5.0 editor

Comment 33 Dovi Jacobs 2008-06-18 05:51:38 UTC

Hebrew vowelization seems much improved in Firefox 3. It would be nice to know exactly what changed and how, and have these things documented in case there are future problems.

Firefox 3 seems to correctly represent the vowel order for webpages in general and Wikimedia pages in particular.

The only anomaly I nevertheless found is that pasting vowelized text into the edit page only shows partial vowelization. On the "saved" wiki page it appears correctly.

Comment 34 Rotem Liss 2008-06-18 12:20:00 UTC

(In reply to comment #33)
> Hebrew vowelization seems much improved in Firefox 3. It would be nice to know
> exactly what changed and how, and have these things documented in case there
> are future problems.
> 
> Firefox 3 seems to correctly represent the vowel order for webpages in general
> and Wikimedia pages in particular.
> 
> The only anomaly I nevertheless found is that pasting vowelized text into the
> edit page only shows partial vowelization. On the "saved" wiki page it appears
> correctly.
> 

The bug of showing the Dagesh and other vowels in the wrong order usually depends on operating system. For example, Windows XP (possibly only with Service Pack 2) displays it well, while older Windows systems don't.

However, Firefox 3.0 did fix some Hebrew vowels bugs, like the problem with Nikud with justified text (see https://bugzilla.mozilla.org/show_bug.cgi?id=60546 ).

Comment 35 Niklas Laxström 2008-07-16 13:02:45 UTC

*** Bug 14834 has been marked as a duplicate of this bug. ***

Comment 36 Ravi Chhabra 2008-07-16 14:23:24 UTC

Since this bug also effects Myanmar exactly in the same way, could the title be appended with Myanmar as well? Normalization is not taking place the way it should. Here is the sort sequence it should be as specified in Unicode Technical Note #11.

Name                 Specification                                  
Consonant            [U+1000 .. U+102A, U+103F, U+104E]
Asat3                U+103A
Stacked              U+1039 [U+1000 .. U+1019, U+101C, U+101E, U+1020, U+1021]
Medial Y             U+103B
Medial R             U+103C
Medial W             U+103D
Medial H             U+103E
E vowel              U+1031
Upper Vowel          [U+102D, U+102E, U+1032]
Lower Vowel          [U+102F, U+1030]
A Vowel              [U+102B, U+102C]
Anusvara             U+1036
Visible virama       U+103A
Lower Dot            U+1037
Visarga              U+1038

I can provide more technical detail if needed. Hence U+1037 should always come after U+103A (even though U+103A is 'higher'). And U+1032 should come _before_ U+102F, U+1030, U+102B, U+102C and so on. I noticed that this but is related more to Unicode Normalization than it is to MediaWiki itself. But an important question I have is *can* Unicode Normalization Check be disabled for Myanmar Wikipedia while we try to resolve it? Thanks, because that would be very helpful?

Comment 37 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-07-16 14:33:31 UTC

(In reply to comment #36)
> Since this bug also effects Myanmar exactly in the same way, could the title be
> appended with Myanmar as well?

You can do things like that yourself here.

> But an important
> question I have is *can* Unicode Normalization Check be disabled for Myanmar
> Wikipedia while we try to resolve it? Thanks, because that would be very
> helpful? 

See [[mw:Unicode normalization concerns]].  This is feasible.  We could turn off normalization for article text and leave it for titles, which would allow DISPLAYTITLE to be used to work around ugly display in titles.  However, it would require some work.

Comment 38 Ravi Chhabra 2008-07-16 15:00:21 UTC

I would prefer normalization as there are benefits from it, since it enforces a particular sequence. My question now is what kind of data should I provide to Brion Vibber so that he can implement the normalization for Myanmar? Our case is quite different from Hebrew and is more straight forward. I believe UTN#11 V2 would be sufficient? It was updated recently for Unicode 5.1 

I would like to wait a while before actually thinking of disabling for article text and using work around for titles. If it can be implemented we won't need to off normalization, and would benefit from it. Thanks.

Comment 39 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-07-16 20:31:21 UTC

It would almost certainly be a bad idea to use different normalization for a single wiki.  This would create complications when trying to, for instance, import pages.  If this is genuinely an issue for Myanmar, we should fix it in the core software for all MediaWiki wikis that contain any Myanmar text.  Same for Hebrew and Arabic.

What exactly is the issue here?  Some user agents render theoretically equivalent sequences of code points differently, so normalization changes display?  Which user agents are these?

Comment 40 Ravi Chhabra 2008-07-16 21:03:32 UTC

Created attachment 5078 [details]
Relative Order (Normalization?) for Unicode 5.1 Myanmar

Comment 41 Ravi Chhabra 2008-07-16 21:04:54 UTC

Created attachment 5079 [details]
Relative Order (Normalization?) for pre-Unicode 5.1/Myanmar

Comment 42 Ravi Chhabra 2008-07-16 21:44:33 UTC

I have attached two images. The first one shows normalization sequence for 5.1, and the 2nd one shows normalization sequence for pre Unicode 5.1. It is drastically different. The copy of those two can be found here.
http://unicode.org/notes/tn11/myanmar_uni-v2.pdf 
Page 4 for latest, and page 9 for deprecated.

The normalization done at MediaWiki seems to be for pre 5.1. I am added pre 5.1 table here.
Name                   Specification 
kinzi                  U+1004 U+1039
Consonant              [U+1000 .. U+102A]
Stacked                U+1039 [U+1000 .. U+1019, U+101C, U+101E, U+1020, U+1021]
Medial Y               U+1039 U+101A
Medial R               U+1039 U+101B
Medial W               U+1039 U+101D
Medial H               U+1039 U+101F
E vowel                U+1031
Lower Vowel            [U+102F, U+1030]
Upper Vowel            [U+102D, U+102E, U+1032]
A Vowel                U+102C
Anusvara               U+1036
Visible virama         U+1039 U+200C
Lower Dot              U+1037
Visarga                U+1038

Yes, normalization changes display. I have attached a jpeg file showing the error caused here https://bugzilla.wikimedia.org/show_bug.cgi?id=14834

Comment 43 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-07-16 22:26:26 UTC

Created attachment 5080 [details]
Contents of includes/normal/UtfNormalData.inc

As far as I can tell, MediaWiki is indeed using the 5.1 tables.  I've attached the data used for normalization, which is generated by a script that downloads the appropriate files from http://www.unicode.org/Public/5.1.0/ucd/.  If you can spot an error, please say what it is.

You might want to talk to Tim Starling, since as far as I can tell he's the one who wrote this.

Comment 44 Ravi Chhabra 2008-07-19 15:48:51 UTC

U+1037 is int(7) and U+103A is int(9), this means that U+1037 should always be first? This seems so similar to the pitah-dagesh issue. :(

This is the relevant section of $utfCombingClass:

["့"]=>
int(7)
["္"]=>
int(9)
["်"]=>
int(9)

The order given here does not seem to be the same as the order given in UTN#11. I guess this would be a lesson not to take UTN's too seriously. I do like the sort order as it is in Wikipedia, just that it's having problems with Fonts. And I am a bit surprised that data in UCS does not match what was authored in the UTN. So as far as MedaiWiki is concerned, it's just like the way it is with Hebrew. We will now need to move over to Unicode mailing list and ask what's going on. Simetrical, many thanks for clearing this one up for me. :)

As a side note developer of Parabaik font gave me this link http://ngwestar.googlepages.com/padaukvsmyanmar3
I noticed that the sequence was recently changed to have been mentioned.

Comment 45 Ravi Chhabra 2008-07-23 23:44:13 UTC

Found something which should not have been re-sequenced. 

Input: U+101E U+1004 U+103A U+1039 U+1001 U+103B U+102C 
Output: U+101E U+1001 U+103B U+102C U+1004 U+103A U+1039 

The output is wrong because U+1004 is consonant, and U+1001 is also consonant. Hence MedaiWiki should not have swapped them, that is if my understanding of Unicode Normalization is correct. My understanding is that the sorting starts over whenever a new consonant starts, because this is the beginning of a new syllable cluster. No fonts will be able to render the output from Mediawiki.

Comment 46 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-07-24 00:34:36 UTC

I suggest you e-mail Tim Starling.

Comment 47 Ravi Chhabra 2008-10-23 02:28:11 UTC

I am adding it here that the issue with Myanmar Unicode (Lower Dot and Visible Virama) is an issue that will be covered in the revision to UTN#11 as a foresight in the standards review process. Due to the stability criteria of UnicodeData.txt there is nothing we can do about this. This is not a MediaWiki bug, since many people are now referencing this to point out as a bug, I need to clarify this here. This sadly does mean that fonts and IMEs will need to update and mean while MediaWiki 1.4 will have the problem mentioned here and the way to resolve this is to simply wait update fonts and IMEs. The advantage of turning off Normalization far outweigh the disadvantage. If there are plans to adopt a less invasive normalization process as mentioned in Normalization Concerns than the issue can be resolved. The developers of fonts and IMEs have agreed to update so those implementing MediaWiki install bases might want to keep Normalization on.  

The 2nd issue with Kinzi (comment #45) seems to be resolved now. Was MediaWiki updated between July and now??

Comment 48 lɛʁi לערי ריינהארט 2010-01-06 12:20:29 UTC

FYI: https://bugzilla.wikimedia.org/show_activity.cgi?id=2399
I did not change priorities; I only added me as CC:.
It seams that the Priority field is gone.

Comment 49 Amir E. Aharoni 2011-05-22 07:45:52 UTC

Marking REOPENED. The standard was updated since 2006. We discussed this in the Berlin Hackathon.

Comment 50 Amir E. Aharoni 2011-05-26 17:47:09 UTC

See another demonstration of this problem here:

http://en.wikisource.org/wiki/User:Amire80/Havrakha

Comment 51 Brion Vibber 2011-05-26 17:54:56 UTC

Assigning to me so we can look over the current state and see about fixing it up.

Comment 52 Philippe Verdy 2011-09-03 15:58:14 UTC

Apparently, you have not implemnted the contractions and expansions of UCA.

Note that there has been NO change in Unicode 5.1 (or later) for the normalization which is now stabilized since at least Unicode 4.0.1.
The bugs above are most probably not related to normalization, if it is implemented correctly (and normalization is an easy problem that can be implemtned very efficiently).

And the changes in the DUCET (or now the CLDR DUCET) do not affect how Hebrew, Arabic or Myanmar is sorted, within the same script.

Then you should learn to separate the Unicode Normalization Algorithm (UNA), the Unicode Collation Algorithm (UCA), and the Unicode Bidi Algorithm (UBA), because the Bidi algorithm only affects the display, but definitely NOT the other two.

And the order produced by normalization is orthogonal to the order of collation weights generated by UCA, even if normalization is assumed to be performed first before computing collations (but this is not a requirement, it just helps reducing the problem, by making sure that canonically equivalent strings will collate the same.

Many posters above seem to be completely mixing the problems !

Comment 53 Philippe Verdy 2011-09-03 16:00:22 UTC

Note: for Thai, Lao, Tai Viet, the normalization does not reorder the prepended vowels (neither do the Bidi algorithm).

But such reordering is *required* when implementing the UCA, and this takes the form of contractions and expansions, that are present in the DUCET for these scripts.

Comment 54 Philippe Verdy 2011-09-03 16:33:04 UTC

Final note: it is highly recommanded to NOT save texts with an implicit normalization. Even if normalization is implemted correctly.

There are known defects (yes bugs in renderers of browsers that frequently do not implement normalizations and that are not able to sort, combine and position the diacritics correctly if they are not in a specific order, which is not the same as the normalized order)

There are also because incorrect assumptions made by writers (that have not understood when and where to insert CGJ to restrict the normalization of reordering some pairs of diacritics), and so have written their texts in such a way that they "seem" to render correctly, but only on a bogous browser not performing the normalizations correctly and/or with strong limitations in their text renderer (unable to recognize strings that are canonically equivalent but for which they expect only one order for successive diacritics in order to position them correctly).

This type of defects is typical of the "bug" described above about the normalized order of the DAGESH (a central point in the middle of a consonannt letter, in order to modify it) or SIN/SHIN DOTS (above the letter, on the left or right, also modifying the consonnant), and the other Hebrew vowel diacritics: Yes the normalization reorders the vowel diacritics before the diacritics that modify the consonnant (this is the effect of an old assignment of their relative "combining classes", in a completely illogical order of values, but this will NEVER be changed as it would affect the normalizations).

But many renderers are not able to display correctly the strings that are encoded in normalized order (base consonnant, vowel diacritic, sin dot or shin dot or dagesh). Instead they expect that the string will be encoded as (base consonnant, dagesh or sin dot or shin dot, vowel diacritic), even if it is completely canonically equivalent to the previous and should display exactly the same ! (such rendering bugs were found in old versions of Windows with IE6 or before).

For this reason, you should not, on MediaWiki, apply any implicit renormalization of any edited text. If one wants to enter (base consonnant, dagesh or sin dot or shin dot, vowel diacritic) in the Wiki text, keep it unchanged, do not normalize it, as it will display correctly on both the old bogous renderers and on newer ones.

Comment 55 Philippe Verdy 2011-09-03 16:37:59 UTC

All my remarks in the previous message also apply to the Arabic diacritics.

For example the assumptions made by Brion Viber in his message #23 are completely wrong. He has not understood what is normalization and the fact that, only with conforming renderers, the normalization *must not* affect the rendering (but if they do, this is due to bugs in renderers, not bugs in the normalizer used on MediaWiki).

Comment 56 merelogic 2011-09-29 12:48:02 UTC

*** Bug 31183 has been marked as a duplicate of this bug. ***

Comment 57 Ryan Kaldari 2011-12-08 21:39:36 UTC

This should probably be reassigned to one of our localization engineers.

Comment 58 matanya 2012-07-30 13:53:46 UTC

reassigned to Amir as he is part of localization engineers. This bug is still present as can seen in : https://en.wikisource.org/wiki/User:Amire80/Havrakha

Comment 59 Dovi Jacobs 2014-02-22 18:06:17 UTC

For an extremely clear description of the problem in Hebrew, see here (pp. 8 ff.):
http://www.sbl-site.org/Fonts/SBLHebrewUserManual1.5x.pdf

Comment 60 Andre Klapper 2014-05-18 09:42:34 UTC

Amir: Do you (or the L10N team) plan to take a look at this at some point? 
This ticket is place 14 in the list of open tickets with the highest votes...

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links