Last modified: 2013-08-24 19:53:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T23429, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 21429 - Arabic double diacritics presentation
Arabic double diacritics presentation
Status: NEW
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
1.16.x
All All
: Low enhancement (vote)
: ---
Assigned To: Amir E. Aharoni
http://id.wikipedia.org/wiki/Pengguna...
: i18n
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-11-07 15:31 UTC by Arif
Modified: 2013-08-24 19:53 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Arif 2009-11-07 15:31:43 UTC
In Arabic, there's presentation of double diacritics. For example, the sequence of "U+0651 ARABIC SHADDA" and "U+0650 ARABIC KASRA" will be presented as "U+FC62 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM". There no such presentation yet in MediaWiki, since the sequence will be swapped after saving. In previous example, the sequence is swapped into U+0650 and U+0651.
Comment 1 Niklas Laxström 2009-11-07 19:03:09 UTC
What is the bug? All text is converted to some normalisation form.
Comment 2 Arif 2009-11-08 07:16:53 UTC
Ups, sorry. I meant in the edit box. The result is fine, since both sequences are converted to correct character. But not in the edit box. An example, I wrote: ARABIC LETTER ALIF, ARABIC LETTER LAM, ARABIC LETTER HAH, ARABIC LETTER REH, U+0651 ARABIC SHADDA, U+064F ARABIC DAMMA. In the edit box, the double diacritics will be converted to U+FC61 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM. Whenever I click "Save page" or "Show preview", the source become: ARABIC LETTER ALIF, ARABIC LETTER LAM, ARABIC LETTER HAH, ARABIC LETTER REH, U+064F, U+0651. This time, there's no U+FC61 character that I expected to see.
Comment 3 Philippe Verdy 2009-11-19 19:48:24 UTC
Isn't the U+FC61 a compatibility character whose normalization excludes decomposition and recombinations under NFD/NFC canonical equivalences?

If some Arabic fonts do not support two successive diacritcs as recommended by Unicode, and only support the decomposable compatibility characters, these fonts are really bogous and should be avoided. But the problem is not there, see below.

If the character is not a canonical equivalent to the two diacritics, it must not be altered (even if it's not recommended).
In other words, MediaWiki must just apply the NFC normalization, but NOT the NFKC normalisation.

When I look at the UCD, it reveals that U+FC61 decomposes as "[isolated] U+0020 U+064F U+0651"

Which means that this is just a compatibility decomposition, and not a canonical decomposition (note also that the decomposition adds an extra space, which in newer documents should rather be a non-breaking space instead of a regular space, to avoid side effects that are possible with whitespace compressions in HTML and XML). Note also that the space still prohibits reordering.

I see no reason then, why Mediawiki would choose to convert U+FC61 incorrectly to U+064F U+0651 (stripping the "[isolated]" compatibility specifier and one space).

And also no reason why it would recombine U+064F U+0651 (adding the leading space and an inexistant [isolated] form) into U+FC61 in the editor.

The same reason should be applied to all the other Arabic compatibility characters (with implicit letter forms) that should be avoided in actual arabic text, unless there is a strong reason to display the character in isolation with a specific form distinct from the normal Arabic presentation rules.
Comment 4 Tim Starling 2010-06-08 04:11:20 UTC
Normalisation of the Arabic presentation forms was requested by members of the Arabic Wikipedia community. I recorded the request at bug 9413 and later implemented it.
Comment 5 Philippe Verdy 2011-01-31 21:17:33 UTC
OK, but bug 9413 just spoke about the presentational forms of letters (i.e. the distinction of *letters* between initial, media, final, and isolated). The Shadda is not a letter and may be inserted at any place within a word as a presentational feature. As it is presentational, changing it by the compatibility mapping will change exactly its presentational semantic.

If the purpose was to convey a single meaning, it should have been stripped completely. When U+FC61 appears, it is used in isolation where its expected width and appearance is important. Changing it will alter its width, and the KASRA may not fit very well.

But may be the font renderers are now capable of handling it and generating exactly what U+FC61 displays when it is mapped in a font (but such mapping is not necessary in any Arabic font, even if those fonts are most often adding those mappings).

I'm not sure this is a big issue. What is the problem if we cannot see the difference, except when editing where you'll type BACKSPACE twice instead of once to delete it completely in insert mode (but no difference when you select if with the mouse).

The only cases where it could make some difference is when U+FC61 is followed by another Arabic diacritic (due to canonical reordering after the compatibility decomposition has been applied. This does not change the BiDi behavior and joining behavior, even if there are spaces or punctuations on both sides.

If it ever appears in the middle of a word, however, this will change its appearance, because the decomposition and the joining type will alter its form. I doubt that such cases are existing in normal Arabic. This could be an issue in IDNA domain names, if this compatibility character was not mandatorily mapped to the normal shadda+diacritic (just like other Arabic compatibility presentational forms), but it should merit some investigation to check that this is effecgtively the case with the newer IDNA RFCs and Unicode papers about IDNA (which has relaxed some rules to allow more characters that were restrited before).

But if this causes any problem in a URL inserted as the target of an external link, one could still use the "xn--" notation in the hidden URL. But I also have serious doubts that such an URL with compatibility URLs would be harmless (most probably in a cybersquatting domain), where instead it could be valid and distinct within the URL query string part, or anchor part, or path part, for example as a link to a site detailing the Unicode properties of this compatibility character ; but may be there's a way to still encode the URL specially).

Anyway, all those Arabic compatibility characters are really not recommanded within any part of a stable URL, and are also no longer generated by Arabic keyboards in any decent browser since long (and they are most probably detected in browsers or security suites as dangerous if ever found in an URL, where the brower or its security extension will propose to the user to follow the link with the normal characters, or cancel the navigation and come back, or confirm that the user really wants to go there after he's been warned, notably if they appear in the domain name part, in some IDNA-enabled registry or private subregistry that does not implement a restriction on those characters in their DNS records).
Comment 6 Stephen G. Brown 2013-08-24 19:53:20 UTC
This bug was first reported in Bug 2399 - Unicode normalization "sorts" Hebrew/Arabic/Myanmar vowels wrongly.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links