Last modified: 2010-05-15 15:29:35 UTC
Image-uploads from MacOS X to a wikipedia using UTF-8 results in the image not being found later. This appears to be independent of the browser used (i'm not experiencing this bug myself, as I don't have a Mac. I'm just reporting something that has been discussed in the german WP: <http://de.wikipedia.org/w/wiki.phtml?title=Wikipedia_Diskussion:UTF8-Probleme#Umlaute_in_Upload_Dateinamen_bei_Mac_OS_X> (german)) The reason for this problem seems to be that the MacOS filesystem uses a different decomosition-policy for filenames than is used on other operating systems or by most browsers. To me it seems that the best solution (and The Right Thing) would be to perform a unicode canonisation (see <http://www.unicode.org/notes/tn5/>) on the server side, on names of uploaded files, but also on search terms and titles of articles. To clarify: in unicode (and therefore in UTF8) there are often several way of expressing the same character. For instance, there is a separate character for "ü", but also a way to express it as "u" + "dots". The two representations are (should be) equivalent, but are not handeled as such by the wiki software. If would be best to enforce a consisten internal canonisation by processing all incomming unicode. The following appeared on the mailinglist unicode@unicode.org: FYI, by far the largest source of text in NFD (decomposed) form in Mac OS X is the file system. File names are stored this way (for historical reasons), so anything copied from a file name is in (a slightly altered form of) NFD. Also, a few keyboard layouts generate text that is partly decomposed, for ease of typing (e.g., Vietnamese). Deborah Goldsmith Internationalization, Unicode liaison Apple Computer, Inc. goldsmit@apple.com This makes it quite clear that this is not a BUG on the part of MacOS - it's a classical incompatibility, which should be handeled by the server.
Same in French wiki at http://fr.wikipedia.org/wiki/Image:Lieutenant-colonel_des_armes_a%CC%80_cheval.png Note the "a%CC%80"
I have dug up some mor info on this: The crucial point is that *some* canonisation (normal form) should be used as internal representation. For compatibility reasons, this should probably be a composed form, as the decomposed forms are rendered badly on some systems. Here is the official document about unicode normal forms: <http://www.unicode.org/reports/tr15/> HTH
This has to be done at least for: * User names * File names * Page titles and should be also done for wikitext, at least in the searchindex.
Bug seems to be specific to Safari. Firefox 0.9.1 and IE 5.2.3 both normalize the name to the precomposed form. Uploaded from Safari: http://meta.wikimedia.org/wiki/Image:Wiki_test_e%CC%81.png Uploaded from Firefox and IE: http://meta.wikimedia.org/wiki/Image:Wiki_test_%C3%A9.png Nonetheless we certainly should be normalizing input... Check if there's an iconv or mb_* function for doing this efficiently.
I spent a few minutes googling and came up with nothing useful pre-existing in PHP. Guess I'll have to write another hack. :P All the necessary data should be in the Unicode data tables... It may be possible to write a DSO extension that makes use of existing library functions (libidn seems to have UTF-8-based normalization functions for instance) but we'll need a 'native' PHP version anyway for general distribution.
libidn provides a stringprep_utf8_nfkc_normalize() function. The glyphs created by this normalization differ from the input. e.g. ² becomes 2. When using this for user names, would we want to preserve the original string for displaying but use an internal representation for comparing? There is a PHP-libidn binding at http://php-idn.bayour.com/ but it looks they do not yet provide access to stringprep_utf8_nfkc_normalize().
The ucdata library might be interesting, it provides both composition and decomposition, upper case, etc. http://crl.nmsu.edu/~mleisher/ucdata.html The download page at that site is broken, there is rev 2.5 available at ftp://crl.nmsu.edu/CLR/multiling/unicode/ucdata-2.5.tar.gz
A further note: in addition to being decomposed, Safari actually is sending the filename with **HTML character references**: "Wiki test é.png" Adding an accept-charset attribute to the <form> unfortunately doesn't seem to change anything. Also, in current 1.4 cvs the # is now stripped to - before we get to the point where we normalize the title and would be interpreting the character, so things get even weirder.
Now fixed in 1.4 CVS. Might consider backporting the isolated filename normalization part to 1.3 on account of the safari problem without risking the general case; leaving this bug open for the moment.
1.4 nearing release; not backporting.