Last modified: 2010-05-15 15:38:22 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T5786, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 3786 - Use utf8 connection charset for mysql >= 4.1
Use utf8 connection charset for mysql >= 4.1
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.5.x
PC Linux
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch
Depends on:
Blocks: 3738
  Show dependency treegraph
 
Reported: 2005-10-23 23:05 UTC by Andreas Neuhaus
Modified: 2010-05-15 15:38 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Andreas Neuhaus 2005-10-23 23:05:35 UTC
Sorry for mistakenly reporting this in #3738

I'm using mediawiki-1.5.0 with MySQL 4.1.14, which is set to use UTF8 as the
default charset. However, because most other old web apps here use Latin1, the
default connection charset is set to Latin1 (in my.cnf: init-connect = 'set
names latin1'). I.e. the MySQL server keeps everything in UTF8, but talks Latin1
by default to the clients. Because mediawiki is fully capable of using UTF8, it
can/should talk UTF8 directly with the database, so I changed
includes/Database.php to tell MySQL that mediawiki wants to talk UTF8:

--- mediawiki-1.5.0/includes/Database.php 2005-08-28 23:59:17.000000000 +0200
+++ wiki/includes/Database.php 2005-10-23 19:06:24.000000000 +0200
@@ -223,6 +223,7 @@
# may cause some operations to fail possibly.
@/**/$this->mConn = mysql_connect( $server,
$user, $password );
}
+ $this->query('SET NAMES utf8');
}

if ( $dbName != '' ) {

This works fine, but should probably have an additional check like: if
mysql_version >= 4.1 then send "set names utf8".

mediawiki-1.5.0 was also able to display everything correctly while talking
Latin1 with our database, but I suppose in that case mediawiki did the
translation from latin1 to utf8 (because pages were displayed in utf8). I
suppose, letting the database to the translation scales better than changing
encodings in PHP. Also there was a minor glitch with pages that have german
umlauts in their name. The page links were displayed in red (not existing), even
if they existed. And if you click the link, the edit page came up and the
textbox was filled with the content of the existing page.
Comment 1 Brion Vibber 2005-10-23 23:11:01 UTC
MySQL 4.1 and 5.0 have insufficent UTF-8 support as discussed on the mailing list: 
http://mail.wikipedia.org/pipermail/wikitech-l/2005-October/thread.html#31960
Comment 2 Andreas Neuhaus 2005-10-23 23:44:57 UTC
I read the entire thread and understand that the page content is not affected by
any translation, because it's stored in a blob column.
But in my example above the database connection charset defaults to latin1
(which is the default configuration for Gentoo Linux and probably others), which
means that MySQL translates the query results from server-charset (utf8 in my
case) to latin1. This results is a lot of data loss (not everything can be
translated to latin1). Telling MySQL to talk utf8 turns off the translation to
latin1 and page names were displayed correctly again in my case.

What I mean is that mediawiki maybe shouldn't rely on the default connection
charset setup but explicitly tell MySQL not to translate results. If "set names
utf8" is problematic, maybe turning off conversion completely ("set
character_set_results=NULL") could be an option?
Comment 3 Brion Vibber 2005-10-26 02:00:16 UTC
Experimental support for 'set names utf8' and explicitly-defined utf8 charset on tables 
is in REL1_5 and HEAD. (Will be in 1.5.1 release.)

While still insufficient for our use on Wikipedia because of the breakage on page 
titles, usernames, comments, etc, you might give this a try.
Comment 4 Brion Vibber 2005-10-28 08:33:47 UTC
Forgot to close this. Future backend changes can improve things, but it'll still use 
the utf8 connection charset. :D

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links