Last modified: 2010-05-15 15:29:35 UTC
Image-uploads from MacOS X to a wikipedia using UTF-8 results in the image not
being found later. This appears to be independent of the browser used (i'm not
experiencing this bug myself, as I don't have a Mac. I'm just reporting
something that has been discussed in the german WP:
The reason for this problem seems to be that the MacOS filesystem uses a
different decomosition-policy for filenames than is used on other operating
systems or by most browsers. To me it seems that the best solution (and The
Right Thing) would be to perform a unicode canonisation (see
<http://www.unicode.org/notes/tn5/>) on the server side, on names of uploaded
files, but also on search terms and titles of articles.
To clarify: in unicode (and therefore in UTF8) there are often several way of
expressing the same character. For instance, there is a separate character for
"ü", but also a way to express it as "u" + "dots". The two representations are
(should be) equivalent, but are not handeled as such by the wiki software. If
would be best to enforce a consisten internal canonisation by processing all
The following appeared on the mailinglist email@example.com:
FYI, by far the largest source of text in NFD (decomposed) form in Mac
OS X is the file system. File names are stored this way (for historical
reasons), so anything copied from a file name is in (a slightly altered
form of) NFD.
Also, a few keyboard layouts generate text that is partly decomposed,
for ease of typing (e.g., Vietnamese).
Internationalization, Unicode liaison
Apple Computer, Inc.
This makes it quite clear that this is not a BUG on the part of MacOS - it's a
classical incompatibility, which should be handeled by the server.
Same in French wiki at
Note the "a%CC%80"
I have dug up some mor info on this:
The crucial point is that *some* canonisation (normal form) should be used as
internal representation. For compatibility reasons, this should probably be a
composed form, as the decomposed forms are rendered badly on some systems. Here
is the official document about unicode normal forms:
This has to be done at least for:
* User names
* File names
* Page titles
and should be also done for wikitext, at least in the searchindex.
Bug seems to be specific to Safari. Firefox 0.9.1 and IE 5.2.3 both normalize the name to the precomposed form.
Uploaded from Safari:
Uploaded from Firefox and IE:
Nonetheless we certainly should be normalizing input... Check if there's an iconv or mb_* function for doing this efficiently.
I spent a few minutes googling and came up with nothing useful pre-existing in PHP. Guess I'll have to write another hack. :P
All the necessary data should be in the Unicode data tables... It may be possible to write a DSO extension that makes use of
existing library functions (libidn seems to have UTF-8-based normalization functions for instance) but we'll need a 'native'
PHP version anyway for general distribution.
libidn provides a stringprep_utf8_nfkc_normalize() function. The glyphs created
by this normalization differ from the input. e.g. ² becomes 2.
When using this for user names, would we want to preserve the original string
for displaying but use an internal representation for comparing?
There is a PHP-libidn binding at http://php-idn.bayour.com/ but it looks they do
not yet provide access to stringprep_utf8_nfkc_normalize().
The ucdata library might be interesting, it provides both composition and
decomposition, upper case, etc.
The download page at that site is broken, there is rev 2.5 available at
A further note: in addition to being decomposed, Safari actually is sending the
filename with **HTML character references**: "Wiki test é.png"
Adding an accept-charset attribute to the <form> unfortunately doesn't seem to
change anything. Also, in current 1.4 cvs the # is now stripped to - before we
get to the point where we normalize the title and would be interpreting the
character, so things get even weirder.
Now fixed in 1.4 CVS.
Might consider backporting the isolated filename normalization part to 1.3 on account of the safari problem without
risking the general case; leaving this bug open for the moment.
1.4 nearing release; not backporting.