Last modified: 2013-10-29 05:09:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T31005, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 29005 - WebFonts converts some unicode sequences to older deprecated forms
WebFonts converts some unicode sequences to older deprecated forms
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
WebFonts (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Santhosh Thottingal
: i18n
Depends on:
Blocks: 56295
  Show dependency treegraph
 
Reported: 2011-05-16 04:44 UTC by praveenp
Modified: 2013-10-29 05:09 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description praveenp 2011-05-16 04:44:03 UTC
WebFonts unnecessarily converts some Malayalam character to old unicode representations (http://unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters). Old char points are not supported by many softwares including Apple's safari and Google Chrome (chromium), so WebFonts cause a real problem in reading. It also breaks ability of user to interlink articles by copy pasting the title.
Comment 1 Santhosh Thottingal 2011-05-18 13:52:42 UTC
Problem with Chromium is their bug. We have already reported it. 
http://code.google.com/p/chromium/issues/detail?id=45840
If apple safari does not show Malayalam properly, it is their bug. Consider filing a bug(if possible).

From http://unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters
"Because older data will use different representation for chillus, implementations must be prepared to handle both kinds of data."

If chromium or apple safari breaks this standard, please report bugs against them. Chrome is known to buggy with Malayalam and all the news portals, blogs of Malayalam are aware of this issue. From the above bug, you can see that it is not limited to Malayalam. Chrome cannot render "Srilanka" written in Sinhala because of that bug.

By the normalization rules in Webfont, it tries to use the least common denominator for the encoding, so that users from older versions and new versions of operating system can read, use the content(copy the content to local machine).

Since there is a force conversion of whatever user writes/edits to atomic chillu characters, in Malayalam wiki, there is no problem with interlinking articles by copy pasting. The normalization code automatically converts it to correct link. If this is not the case, please show me an example.

Unless the webfonts does not break any of the functionality of mediawiki, the normalization code is going to be present there.
Comment 2 Santhosh Thottingal 2011-05-18 14:11:57 UTC
As an additional choice, I am going to add Anjali Old Lipi font as default font for Malayalam. It has dual encoding implemented. To avoid any issues with font preference, I am going to list them in alphabetical order: Anjali , Meera, Rachana, Raghu Malayalam.
Comment 3 praveenp 2011-05-20 05:09:21 UTC
Chillu characters from unicode 5.1 is not affected by those chromium and safari bugs. Current script converts characters not affected by the bug to those affected by the bug. I think Malayalm Wikipedia not using WebFonts yet, besides Malayalam Wikipedia, there may be many other implementations to Mediawiki. I wonder how this can help interlinking by copy-pasting the titles there. You may use these fonts created by Junaid P V (https://github.com/junaidpv/Malayalam-Fonts/archives/master) with new chillu character, which contains all fonts listed here.

I am not sure about working of WebFonts from mobile devices but none of those 
from Apple, including iOS not supporting old encoding. It is important because day by day traffic through them are increasing.

Mediawiki itself has problem with joiner based characters.
Comment 4 Bawolff (Brian Wolff) 2011-05-20 05:19:01 UTC
>Mediawiki itself has problem with joiner based characters.

Really, like what? (I'm just curious). Is there a bug about it?
Comment 5 praveenp 2011-05-20 17:07:46 UTC
(In reply to comment #4)
> >Mediawiki itself has problem with joiner based characters.
> 
> Really, like what? (I'm just curious). Is there a bug about it

Before mediawiki deployment 1.16 for wikimedia wikies, search error was common because mediawiki ignored the joiner for searching, and consider both characters (characters with joiner and without joiner) are same. After 1.16 deployment chillu characters in database switched to 5.1 version, so I am not sure whether it persists in mediawiki now ;-) But I haven't heard anyone fixed this.

Please see an old screenshot here http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/46413/

all results in the screenshot are wrongly displayed.
Comment 6 Santhosh Thottingal 2011-05-22 16:13:58 UTC
(In reply to comment #3)
> Chillu characters from unicode 5.1 is not affected by those chromium and safari
> bugs. Current script converts characters not affected by the bug to those
> affected by the bug.

Let us not mix the chromium bug and an optional feature of webfonts.

> I think Malayalm Wikipedia not using WebFonts yet, besides
> Malayalam Wikipedia, there may be many other implementations to Mediawiki. I
> wonder how this can help interlinking by copy-pasting the titles there. 

Webfonts is an _extension_ .For mediawiki instances outside wikipedia, one can decide whether it should be installed, enabled or not. If installed, a user can disable it completely using user preference screen. Or temporarily using the menu. The extension is configurable and one can completely remove the normalization rules. one can use their own fonts. And it is well documented. see http://www.mediawiki.org/wiki/Extension:WebFonts  So whether the rules are necessary for an instance is upto the admin of the wiki. 

> You may
> use these fonts created by Junaid P V
> (https://github.com/junaidpv/Malayalam-Fonts/archives/master) with new chillu
> character, which contains all fonts listed here.

We cannot and we should not use unofficial fonts from random location. Those fonts are not even a fork or maintained by typography experts. If somebody report a bug in font, I need to contact typographers and for that I should use official, upstream fonts. Anjali Old Lipi font has dual encoding implemented, the upstream version from varamozhi project is already added to webfonts. And that is the default font.

> I am not sure about working of WebFonts from mobile devices but none of those 
> from Apple, including iOS not supporting old encoding. It is important because
> day by day traffic through them are increasing.

It is a separate topic altogether. If mobile devices have broken rendering, it is bug with them. Srilanka is not going to change their country name if mobile names does not render it properly when written in Sinhala or the traffic is less for Sinhala wikipedia. But mobile phones will fix the bug.

Also note that, there is nothing called old encoding. Old encoding means old data. There is no concept of old data and new data. it is just data. Dual encoding of Malayalam is very complicated and serious issue and cannot be solved by webfonts. We discussed this during the language committee meeting and we are trying to find a solution for that.  And please don't bring that chillu discussion here, we had enough of them :)
Comment 7 praveenp 2011-05-24 14:09:05 UTC
(In reply to comment #6)
I am sorry but none of those reasons are not good enough for converting readable "data" to non-readable "data". Junaid P.V. is not a random person, as you know who contributes his time for Malayalam Computing and he is the developer of Narayam Extension for adding different language input methods for the various text input fields in Mediawiki. AGF :)
Comment 8 Mark A. Hershberger 2011-05-24 16:50:39 UTC
(In reply to comment #6)
> (In reply to comment #3)
> > Chillu characters from unicode 5.1 is not affected by those chromium and safari
> > bugs. Current script converts characters not affected by the bug to those
> > affected by the bug.
> 
> Let us not mix the chromium bug and an optional feature of webfonts.

I'm confused by this and then praveenp's response:

(In reply to comment #7)
> I am sorry but none of those reasons are not good enough for converting
> readable "data" to non-readable "data".

If webfonts is converting "Chillu characters from Unicode 5.1", that would be a bug, right?

Or am I reading it wrong?
Comment 9 praveenp 2011-05-25 03:50:03 UTC
(In reply to comment #8)
> If webfonts is converting "Chillu characters from Unicode 5.1", that would be a
> bug, right?

I think he created so as a feature, but ultimately to an end user it is a bug. It can be an option to users (don't know for what), but giving converted chillu characters by default is a failure on usability.
Comment 10 Gerard Meijssen 2011-05-25 05:50:33 UTC
The presentation is determined by how the characters are compiled. Many people prefer to read the language in a particular way. This is not necessarily in the latest way Unicode has it. 

By allowing for these differences in the presentation, our public is increased while at the same time at the backend we retain the latest Unicode version.
Comment 11 Bawolff (Brian Wolff) 2011-05-26 05:54:19 UTC
>Many people prefer to read the language in a particular way

Please please please no one make that into a preference. We're talking about unicode code points, not background colours.

The two ways of encoding the character *should* be identical to the user, they aren't - mostly due to crappy software support, but they should be [in the ideal world, well in the ideal world there wouldn't be two ways to encode a single character...].

While its a little weird for mediawiki on one end to force convert everything 5.1 encoding, spit it out, then on the js side run a regex through the entire page converting it back to the older encoding, it doesn't seem horrible if it makes it work for everyone.

Anyways, going back to the original bug:

*For the issue of Safari being stupid and stripping ZWJ's: Since we're doing this on the client side anyways, might I suggest that webfonts detects what browser is in use, and only normalizes like that if its using a browser that isn't broken in that way (or disables those fonts from the choice menu if they require such normalizations)

*Per comment 1, I'm also unclear how this could break interlinking, since they're all normalized to one form on the mediawiki side (While I guess it could if the content language is not ml, since we only do the normalization for ml, but that seems like an edge case)
Comment 12 Santhosh Thottingal 2011-05-26 06:56:43 UTC
Rendering issue with Google chrome for ZWJ/ZWNJ got fixed in chrome 12. ie http://code.google.com/p/chromium/issues/detail?id=45840 fixed now. 

And Google chrome does not support webfonts with complex scripts. http://code.google.com/p/chromium/issues/detail?id=78155 This also got fixed in chrome 12.
Comment 13 praveenp 2011-05-26 07:42:17 UTC
(In reply to comment #11)
> *Per comment 1, I'm also unclear how this could break interlinking, since
> they're all normalized to one form on the mediawiki side (While I guess it
> could if the content language is not ml, since we only do the normalization for
> ml, but that seems like an edge case)

:) http://wiki.smc.org.in is Malayalam site which using English interface.

And people are still terribly addicted to English while using Internet.

Is it possible for a site admin to set $wgFixMalayalamUnicode as false in DefaultSettings.php for keeping user's contributions untouched (?). Popular Windows tools and Mac tools for typing Malayalam using 5.1 encoding. So copy-pasting title for linking will surely fail. I know default implementation - Anjali Old Lipi - gives exactly same as in database. But somehow it is buggy (lot of spelling mistakes) and people will eventually switch to some other font given, for better display.
Comment 14 Santhosh Thottingal 2011-05-26 07:57:04 UTC
(In reply to comment #11)
> >Many people prefer to read the language in a particular way
> 
> Please please please no one make that into a preference. We're talking about
> unicode code points, not background colours.
> 
> The two ways of encoding the character *should* be identical to the user, they
> aren't - mostly due to crappy software support, but they should be [in the
> ideal world, well in the ideal world there wouldn't be two ways to encode a
> single character...].

In ideal world, we expect that. But in real world, dual encoding is there, at least for Malayalam. To make things simple:

A letter L is written in L1 way. And in 2009, Unicode said it can be written in L2 way too. And asked applications to support both. Obviously many applications  failed to do this. Unicode did not define that L1 and L2 are equal. So big issues with search, sort.. what not?. ml wikipedia decided to keep the data in L2 using a force conversion. Many websites decide to stick with L1 for stability, backward compatibility issues until there is a Unicode definition stating L1 == L2. because that is the minimum version they(not limited to websites, os and applications too)can support. At the same time L1 is well supported in majority of applications(Google chrome used to support it , from chrome 6.0 to chrome L1 it was broken, now fixed). There are fonts which does not show same glyph for L1 and L2 because the typographers care about the language and aware of dual encoding issues. So to make everybody happy, just for these extreme cases, I added a feature to do L2->L1 conversion. So that users can view/use L1 which is working in the systems for many years. It is not meant for all languages or for all fonts. And it is a configuration entry.

L1 vs L2 is very controversial issue. And it becomes more complex when I say that there are more than one L with this issue. 

> *Per comment 1, I'm also unclear how this could break interlinking, since
> they're all normalized to one form on the mediawiki side (While I guess it
> could if the content language is not ml, since we only do the normalization for
> ml, but that seems like an edge case)

You are correct. It does not break any interlinking.

Since there is no reproducible case how this option breaks anything, I explained my best why it was added, shall we close it? 

Finding out broken versions of the browser(In our case chrome 6 to 11) and changing extension behavior based on that...Should we really need to do that? Considering Chrome does not support webfonts(complex script webfonts. Malayalam is an example) at all till Chrome 11, it seems unnecessary. Let us just declare that "Chrome was broken, and was not supporting Malayalam rendering or Malayalam webfonts till version 12". I hope that helps. Proof is bug 45840 and 78155 of chrome.
Comment 15 praveenp 2011-05-26 09:32:07 UTC
(In reply to comment #14)

Pls do not mix other issues with current problem. Even if all other bugs including those in safari, mobile devices get fixed like chromium, why we really want  convert encoding in database to old version encoding without reader's direct request.

Why this converting cannot be an option other than default?  

Sticking with Unicode 5.0 font is not a good idea for Malayalam. Unicode corrected errors like representation of "zero". And new code points will included for new symbols (eg: Rupee Symbol - Unicode 6.0).

Implementation of default Anjali old lipi is buggy.

So reopening.
Comment 16 praveenp 2011-06-14 01:38:25 UTC
Chromium bug is still open! - Chromium 12.0.742.91 (87961) Ubuntu 11.04.
Comment 17 Niklas Laxström 2011-08-31 12:44:40 UTC
(In reply to comment #15)
why we
> really want  convert encoding in database to old version encoding without
> reader's direct request.

This is not what happens. WebFonts only converts what is displayed (and even that happens only for some particular fonts). MediaWiki itself normalizes all data to specific format (which is ewest format in Unicode, as far as I know).
Comment 18 Junaid 2011-08-31 13:53:06 UTC
(In reply to comment #17)
> This is not what happens. WebFonts only converts what is displayed (and even
> that happens only for some particular fonts). MediaWiki itself normalizes all
> data to specific format (which is ewest format in Unicode, as far as I know).

But, WebFonts convert data in text fields too. That will make problems on wikies that do not have normalisation enabled, for example Wikimedia Commons. If we open and save pages in such wikies, data will be converted unintentionally. I think this is a critical bug.
Comment 19 Santhosh Thottingal 2011-08-31 14:19:02 UTC
(In reply to comment #18)
> But, WebFonts convert data in text fields too. That will make problems on
> wikies that do not have normalisation enabled, for example Wikimedia Commons.
> If we open and save pages in such wikies, data will be converted
> unintentionally. I think this is a critical bug.

Normalization is enabled in wikis as a fix for a reported bug. If it is not there, Firefox, chrome extensions like fix-ml, people using Inscript keyboards, and keyboards other than Narayam will surely enter chillu, AU vowel sign, NTA in 5.0 unicode way. This was considered as bug and thats why normalization is enabled in other wikis. If it is not enabled in commons, please file a bug for that. 

Please understand that dual encoding is an issue. and the Malayalam normalization in wiki is workaround and a not a solution. The solution should come from UTC. I am trying for that. Can we just hold on till we get any reply from TDIL  or UTC on that? Or I can keep only AnjaliOldLipi font alone for Malayalam. If I get a confirmation from 2-3 people from Malayalam wiki, I will remove Meera, Rachana, RaghuMalayalam fonts and there by avoiding the normalization rules of that fonts. Let me know.
Comment 20 Bawolff (Brian Wolff) 2011-08-31 14:52:27 UTC
>If it is not enabled in commons, please file a bug for that. 

Currently MediaWiki's chillu normalizations (which I believe is what comment 18 is referring to) are only enabled on wikis with a content language of ml ( see docs on $wgFixMalayalamUnicode ). It would probably make sense to have those normalizations on multi-lingual wikis as well (for that matter its weird that there are different normalizations in use for different language wikis, but i guess there are performance concerns) but anyways that is a separate bug.
Comment 21 Junaid 2011-09-01 06:24:14 UTC
What about removing normalisation within this extension and using hacked fonts that can show all characters for Malayalam?
Comment 22 Niklas Laxström 2011-09-01 07:22:42 UTC
(In reply to comment #21)
> What about removing normalisation within this extension and using hacked fonts
> that can show all characters for Malayalam?

Is there no font that can show all characters without hacking?
Comment 23 Junaid 2011-09-01 08:23:19 UTC
(In reply to comment #22)
 
> Is there no font that can show all characters without hacking?

Only one, AnjaliOldLipi, among popular fonts and used by WebFonts. It is what second para of comment #19 referring.
Comment 24 praveenp 2011-10-26 05:42:58 UTC
I wonder why this prioritized low, even though it affects users directly!
Comment 25 vssun 2011-10-26 06:38:29 UTC
(In reply to comment #23)
> (In reply to comment #22)
> 
> > Is there no font that can show all characters without hacking?
> 
> Only one, AnjaliOldLipi, among popular fonts and used by WebFonts. It is what
> second para of comment #19 referring.

Now aruna also shows well. http://sourceforge.net/projects/aruna/
Comment 26 Cibu C J 2011-12-16 22:41:51 UTC
Unicode didn't add the Malayalam Chillu characters on a whim. It was added after around 2 years of deliberations. UTC finally concluded that, practice existed before 5.1 was problematic and standalone characters has to be defined for Malayalam Chillus.

It is a misreading of the standard that it specifies two different encodings for chillus. There is only one encoding and that is the standard chillus defined in 5.1. What standard says is, the rendering implementations should be prepared to handle the pre-existing data that was present, before chillus were properly defined. So, if at all you are converting the codepoints, that should be from pre-existing sequences to standard chillus.

Also, keep in mind that never these two sequences (standard chillus, and pre-existing sequence counterparts) will be canonically equivalent. Characters has to be marked canonically equivalent when they are defined. That didn't happen; so it will never happen as per the rules.

We don't need to play UTC here. Rather, we should be thinking about what is best for the Malayalam users. If you take the stock of things today from the implementation point of view, it is like this:

Standard chillus(>=5.1):
- All rendering systems support them because they are plain simple characters without any special joining properties. If the font has it, rendering engine can display it.
- Almost all Malayalam fonts support it. In case of fonts like Rachana, Meera etc, even though original version does not have the chillu characters, there are versions available with the standard chillus.

Pre-existing non-standard chillus(<5.1):
- In case of rendering systems it is a hit or miss. Some browsers in some systems can display them correctly - example. Firefox + Linux, Chrome + Windows etc. Some others cannot display them. For example, Chrome + Linux.
- All Malayalam fonts support them.

Since this is about WebFonts, fonts are in Wikimedia's control, but the rendering systems are not. So you should be going with the option that would fetch maximum support from rendering systems.

Also, I want to mention the original political positions of Santhosh and me. Santhosh was arguing against standalone chillus and I was arguing for it. However, decision has been made by UTC years back. Now it is time for implementations to follow the standard so that a standard will be beneficial to its users. Wikimedia should not get stuck in Unicode 5.0 and it should progress to later versions as the Unicode standard progresses.
Comment 27 Shiju Alex 2011-12-17 02:47:15 UTC
I found Santhoish Thottingal is trying to bring dual encoding issue into wikimedia world using webfont as a plat form. This is not all acceptable to Malayalam wikimedia community. Dual encoding is not an issue inside wiki projects. 


This issue needs to be fixed immediately considering the severity of the issue. I have changed the priority of the issue. 

Also some third person developer need to handle all the issues related to malayalam. Santhosh is using his official role in WMF to play around with Malayalam data to push his personal POV (and his Free Software organization's POV). This is not acceptable to Malayalam wikimedia community.
Comment 28 Santhosh Thottingal 2011-12-17 09:06:50 UTC
(In reply to comment #26)
> Also, I want to mention the original political positions of Santhosh and me.
> Santhosh was arguing against standalone chillus and I was arguing for it.
> However, decision has been made by UTC years back. Now it is time for
> implementations to follow the standard so that a standard will be beneficial to
> its users. Wikimedia should not get stuck in Unicode 5.0 and it should progress
> to later versions as the Unicode standard progresses.

I don't have any disagreement in this.  And I am not for continuing mediawiki or any software in any older unicode versions. Yes, I had disagreement on UTC's  decision. But that is irrelevant now. I want to support new version of unicode eveywhere. I have asked the designers of the font to update to new versions.They were not ready and they had disagreement with UTC's decision. Recently they told me that they are not for sticking in 5.0  and want to move forward. New versions of the fonts will be released.. Not only with the characters in question, but also supporting new characters in version of Unicode. I don't think UTC will take any decision on equivalence. Till then I wanted to keep the two fonts Meera and Raghumalayalam as non default fonts. But to add them, I have to use the character conversion. I asked Shiju many tiimes whether I can remove them. But I did not get clear answer. But I am going to remove them now and will add when a new version of those fonts are ready.
Comment 29 Santhosh Thottingal 2011-12-17 09:21:36 UTC
Meera and Raghu Malayalam removed from the options of Malayalam in r106502.
Will be reintroduced when the upstreams release new vesion with latest unicode support.
Now Malayalam got only AnjaliOldLipi as option. Malayalam community(that includes me) can file new bug if any other fonts need to be added(should be opensource, well maintained with active upstream).

Please confirm and close the bug. Thanks
Comment 30 Shiju Alex 2011-12-19 06:07:06 UTC
Thanks for fixing this issue. I suggest some one who is technically good, verify and close this bug.

I am really sorry for the statement //Also some third person developer need to handle all the issues related to Malayalam//

I withdraw that statement and apologizing for it . As long as there is no forced conversion of existing Unicode text to an old Unicode version, just for displaying the text in Unicode 5.0 font, I do not have any problem in Santhosh working in any issue related to Malayalam. Sorry once again for that statement.
Comment 31 Cibu C J 2011-12-19 19:26:14 UTC
(In reply to comment #28)
> (In reply to comment #26)
> forward. New versions of the fonts will be released.. Not only with the
> characters in question, but also supporting new characters in version of
> Unicode. 

That is great news! Thanks Santhosh. Along with that, I would love to see equal opportunity for users to choose between a modern and a traditional orthography font. From 1970s onward, kids are studying new orthography. Whether we like it or not that it is a fact and Mediawiki or any software should honor that. However, I don't have a font to suggest. Just something to keep in mind for future font selections. 

> I don't think UTC will take any decision on equivalence. Till then I
> wanted to keep the two fonts Meera and Raghumalayalam as non default fonts.

There is no 'till then..'. As I mentioned before, that is not going to happen and no development plans should be wait for anything like that.

> But
> to add them, I have to use the character conversion. I asked Shiju many tiimes
> whether I can remove them. But I did not get clear answer. But I am going to
> remove them now and will add when a new version of those fonts are ready.

What about using the forks of those fonts with standard chillus? I know, when the additional characters defined in later Unicode versions, that will not get propagated to those forks when the original fonts add those chars. However, those chars are really archaic and chillus are very common. So chillu support should trump the support for new archaic chars.
Comment 32 Mark A. Hershberger 2012-01-04 00:47:33 UTC
Setting normal priority since it seems like all the urgent issues here are taken care of.  Leaving this to Santhosh or someone else to close.
Comment 33 Santhosh Thottingal 2012-03-14 12:44:28 UTC
Meera font  updated with latest version from upstream in r113808

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links