Last modified: 2010-05-15 15:29:35 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2215, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 215 - File-Uploads from MacOS X has Problems with UTF-8


Summary:	File-Uploads from MacOS X has Problems with UTF-8

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Uploading (Other open bugs)
Version:	1.3.x
Hardware:	Macintosh Mac OS X 10.0

Importance:	Normal normal with 1 vote (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:	240
Blocks:	202 1225 1503
	Show dependency tree / graph

Reported:	2004-08-25 12:08 UTC by Daniel Kinzler
Modified:	2010-05-15 15:29 UTC (History)
CC List:	1 user (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Daniel Kinzler 2004-08-25 12:08:56 UTC

Image-uploads from MacOS X to a wikipedia using UTF-8 results in the image not
being found later. This appears to be independent of the browser used (i'm not
experiencing this bug myself, as I don't have a Mac. I'm just reporting
something that has been discussed in the german WP:
<http://de.wikipedia.org/w/wiki.phtml?title=Wikipedia_Diskussion:UTF8-Probleme#Umlaute_in_Upload_Dateinamen_bei_Mac_OS_X>
(german))

The reason for this problem seems to be that the MacOS filesystem uses a
different decomosition-policy for filenames than is used on other operating
systems or by most browsers. To me it seems that the best solution (and The
Right Thing) would be to perform a unicode canonisation (see
<http://www.unicode.org/notes/tn5/>) on the server side, on names of uploaded
files, but also on search terms and titles of articles.

To clarify: in unicode (and therefore in UTF8) there are often several way of
expressing the same character. For instance, there is a separate character for
"ü", but also a way to express it as "u" + "dots". The two representations are
(should be) equivalent, but are not handeled as such by the wiki software. If
would be best to enforce a consisten internal canonisation by processing all
incomming unicode.

The following appeared on the mailinglist unicode@unicode.org:

FYI, by far the largest source of text in NFD (decomposed) form in Mac
OS X is the file system. File names are stored this way (for historical
reasons), so anything copied from a file name is in (a slightly altered
form of) NFD.
Also, a few keyboard layouts generate text that is partly decomposed,
for ease of typing (e.g., Vietnamese).

Deborah Goldsmith
Internationalization, Unicode liaison
Apple Computer, Inc.
goldsmit@apple.com

This makes it quite clear that this is not a BUG on the part of MacOS - it's a
classical incompatibility, which should be handeled by the server.

Comment 1 JeLuF 2004-08-25 16:26:32 UTC

Same in French wiki at
http://fr.wikipedia.org/wiki/Image:Lieutenant-colonel_des_armes_a%CC%80_cheval.png
Note the "a%CC%80"

Comment 2 Daniel Kinzler 2004-08-25 16:59:41 UTC

I have dug up some mor info on this:

The crucial point is that *some* canonisation (normal form) should be used as
internal representation. For compatibility reasons, this should probably be a
composed form, as the decomposed forms are rendered badly on some systems. Here
is the official document about unicode normal forms:

<http://www.unicode.org/reports/tr15/>

HTH

Comment 3 JeLuF 2004-08-25 17:18:00 UTC

This has to be done at least for:
* User names
* File names
* Page titles

and should be also done for wikitext, at least in the searchindex.

Comment 4 Brion Vibber 2004-08-25 17:37:22 UTC

Bug seems to be specific to Safari. Firefox 0.9.1 and IE 5.2.3 both normalize the name to the precomposed form.

Uploaded from Safari:
http://meta.wikimedia.org/wiki/Image:Wiki_test_e%CC%81.png

Uploaded from Firefox and IE:
http://meta.wikimedia.org/wiki/Image:Wiki_test_%C3%A9.png

Nonetheless we certainly should be normalizing input... Check if there's an iconv or mb_* function for doing this efficiently.

Comment 5 Brion Vibber 2004-08-25 19:35:11 UTC

I spent a few minutes googling and came up with nothing useful pre-existing in PHP. Guess I'll have to write another hack. :P 
All the necessary data should be in the Unicode data tables... It may be possible to write a DSO extension that makes use of 
existing library functions (libidn seems to have UTF-8-based normalization functions for instance) but we'll need a 'native' 
PHP version anyway for general distribution.

Comment 6 JeLuF 2004-08-25 19:51:53 UTC

libidn provides a stringprep_utf8_nfkc_normalize() function. The glyphs created
by this normalization differ from the input. e.g. ² becomes 2.
When using this for user names, would we want to preserve the original string
for displaying but use an internal representation for comparing?

There is a PHP-libidn binding at http://php-idn.bayour.com/ but it looks they do
not yet provide access to stringprep_utf8_nfkc_normalize().

Comment 7 JeLuF 2004-08-25 20:26:22 UTC

The ucdata library might be interesting, it provides both composition and
decomposition, upper case, etc.

http://crl.nmsu.edu/~mleisher/ucdata.html

The download page at that site is broken, there is rev 2.5 available at
ftp://crl.nmsu.edu/CLR/multiling/unicode/ucdata-2.5.tar.gz

Comment 8 Brion Vibber 2004-08-30 08:47:11 UTC

A further note: in addition to being decomposed, Safari actually is sending the
filename with **HTML character references**: "Wiki test e&#769;.png"

Adding an accept-charset attribute to the <form> unfortunately doesn't seem to
change anything. Also, in current 1.4 cvs the # is now stripped to - before we
get to the point where we normalize the title and would be interpreting the
character, so things get even weirder.

Comment 9 Brion Vibber 2004-09-03 07:18:23 UTC

Now fixed in 1.4 CVS.

Might consider backporting the isolated filename normalization part to 1.3 on account of the safari problem without 
risking the general case; leaving this bug open for the moment.

Comment 10 Brion Vibber 2004-11-19 09:10:56 UTC

1.4 nearing release; not backporting.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links