Last modified: 2010-05-15 15:38:22 UTC
Sorry for mistakenly reporting this in #3738 I'm using mediawiki-1.5.0 with MySQL 4.1.14, which is set to use UTF8 as the default charset. However, because most other old web apps here use Latin1, the default connection charset is set to Latin1 (in my.cnf: init-connect = 'set names latin1'). I.e. the MySQL server keeps everything in UTF8, but talks Latin1 by default to the clients. Because mediawiki is fully capable of using UTF8, it can/should talk UTF8 directly with the database, so I changed includes/Database.php to tell MySQL that mediawiki wants to talk UTF8: --- mediawiki-1.5.0/includes/Database.php 2005-08-28 23:59:17.000000000 +0200 +++ wiki/includes/Database.php 2005-10-23 19:06:24.000000000 +0200 @@ -223,6 +223,7 @@ # may cause some operations to fail possibly. @/**/$this->mConn = mysql_connect( $server, $user, $password ); } + $this->query('SET NAMES utf8'); } if ( $dbName != '' ) { This works fine, but should probably have an additional check like: if mysql_version >= 4.1 then send "set names utf8". mediawiki-1.5.0 was also able to display everything correctly while talking Latin1 with our database, but I suppose in that case mediawiki did the translation from latin1 to utf8 (because pages were displayed in utf8). I suppose, letting the database to the translation scales better than changing encodings in PHP. Also there was a minor glitch with pages that have german umlauts in their name. The page links were displayed in red (not existing), even if they existed. And if you click the link, the edit page came up and the textbox was filled with the content of the existing page.
MySQL 4.1 and 5.0 have insufficent UTF-8 support as discussed on the mailing list: http://mail.wikipedia.org/pipermail/wikitech-l/2005-October/thread.html#31960
I read the entire thread and understand that the page content is not affected by any translation, because it's stored in a blob column. But in my example above the database connection charset defaults to latin1 (which is the default configuration for Gentoo Linux and probably others), which means that MySQL translates the query results from server-charset (utf8 in my case) to latin1. This results is a lot of data loss (not everything can be translated to latin1). Telling MySQL to talk utf8 turns off the translation to latin1 and page names were displayed correctly again in my case. What I mean is that mediawiki maybe shouldn't rely on the default connection charset setup but explicitly tell MySQL not to translate results. If "set names utf8" is problematic, maybe turning off conversion completely ("set character_set_results=NULL") could be an option?
Experimental support for 'set names utf8' and explicitly-defined utf8 charset on tables is in REL1_5 and HEAD. (Will be in 1.5.1 release.) While still insufficient for our use on Wikipedia because of the breakage on page titles, usernames, comments, etc, you might give this a try.
Forgot to close this. Future backend changes can improve things, but it'll still use the utf8 connection charset. :D