Last modified: 2007-02-18 00:45:50 UTC
In the English Wikipedia article on Ruhollah Khomeini, having his name in the native Arabic script (right-to-left) inside the normal English (left-to-right) text of the article causes incorrect interleaving. I am using Mozilla Firefox 2.0 on Fedora Core 5 Linux, and on my browser the first two lines of the text look something like this: "Grand Ayatollah Seyyed Ruhollah Mosavi Khomeini (listen (Persian pronunciation) (help·info)) (Persian: [Arabic text] [Arabic text] Rūḥollāh Mūsavī Khomeynī Arabic: 17) ([Arabic text] May 1900¹ - 3 June 1989) was a..." I've placed "[Arabic text]" where it displays Arabic text so that this bug report itself does not depend on the settings of the browser but illustrates the issue as I see it. The problem is plainly visible: It is supposed to say that Khomeini was born on 17 May 1900, but part of his Arabic name appears between the day "17" and the month "May 1900". When checking the wiki markup source code, everything looks OK, the Arabic text is correctly interleaved with the western text. Is this a bug with MediaWiki or with my browser?
Created attachment 3238 [details] A detail screenshot of the rendered article text illustrating the problem.
This can be fixed by surrounding the text with ‏ and ‎ (for example, ‏Rūḥollāh Mūsavī Khomeynī‎). I suppose this could be templated as {{rtl|Rūḥollāh Mūsavī Khomeynī}}, if that template doesn't already exist. See the previous bug 8996 about similar behaviour on special pages. I think a serverside fix would be applicable to all instances of the direction override problem. *** This bug has been marked as a duplicate of 8996 ***
Bug 8996 is about a completely different problem, talking about a different kind of direction marks. It's not a duplicate.
It's impossible to get correct directionality information from plain Unicode text. Consider: The Hebrew letter "aleph" is א, ב is "bet". Note that aleph is א, bet is ב, and the logical order (as I typed it and as it was encoded) has the א before the ב. The comma and space fall between two RTL characters, so they're treated as RTL embedded in LTR. But semantically, the comma is part of the LTR phrase (delimiting two LTR phrases, which happen to end or begin with RTL characters) and should be treated as LTR text. But consider this, which is syntactically identical: Exodus 1:2 reads, in the original Hebrew: "ראובן, שמעון, לוי, ויהודה". Here the behavior is correct, because in this context, the commas delimit RTL phrases (or words), not LTR phrases. But there's no possible way either MediaWiki or the browser could know that. The Unicode directionality algorithm tries to do the impossible, and consequently fails. The only way to avoid this problem is to add semantic information on how you want the directionality to go, using Unicode directionality marks: The Hebrew letter "aleph" is א, ב is "bet".