Last modified: 2011-01-25 01:09:51 UTC
According to the Unicode FAQ: Q. Is it necessary to use the presentation forms that are defined in Unicode? A. No, it is not necessary to use those presentation forms. Those forms were selected and identified in the early days of developing Unicode when sophisticated rendering engines were not prevalent. A selected subset of the presentation forms was included to provide users with a simple method to generate them. Q. Can one use the presentation forms in a data file? A. It is strongly discouraged and not recommended because it does not guarantee data integrity and interoperability. In the particular case of Arabic, data files should include only the characters in the Arabic block, U+0600 to U+06FF. Unidentified broken clients are inserting Arabic presentation forms into articles on ar.wikipedia.org. This causes problems because some browsers do not display these characters. I suggest we convert presentation forms to their canonical equivalent during NFC normalisation on page save. For those rare cases where isolated characters in specified forms are required, HTML character entities can be used.
We already do NFC normalization on page save. Are you asking for additional conversions? If so, can you specify?
Yes additional conversions. The Arabic presentation forms (FB50-FDFF and FE80-FEFF) should be converted to their equivalents in the Arabic block, 0600-06FF. The relevant mapping is given in the Decomposition_Mapping field of UnicodeData.txt. For example: FB51;ARABIC LETTER ALEF WASLA FINAL FORM;Lo;0;AL;<final> 0671;;;;N;;;;; Because there is a formatting tag "<final>", this is a compatibility mapping (part of NFKC), rather than a canonical mapping (part of NFC).
Fixed in r60599.