Last modified: 2010-05-15 15:50:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 11608 - should mediawiki_mysql2postgres.pl use "--compatible=postgres"?
should mediawiki_mysql2postgres.pl use "--compatible=postgres"?
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.10.x
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-10-08 23:37 UTC by Robert Fleming
Modified: 2010-05-15 15:50 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Robert Fleming 2007-10-08 23:37:33 UTC
Should maintenance/postgres/mediawiki_mysql2postgres.pl use "--compatible=postgres" for mysqldump?

I was migrating from a MySQL to PostgreSQL.  My MySQL (stock Debian install) database contains UTF-8-encoded text, although every column of every table I checked was set to use "Character Set" "cp1252"; I don't know why, and I don't know what that setting affects, but anyway my MediaWiki installation had no problem pulling the UTF-8 data from MySQL and properly presenting it to the web browser.

mysqldump, using the options as specified in mediawiki_mysql2postgres.pl, seems to take the UTF-8-encoded data, treat it as latin1 (or cp1252?), and re-encode that text as UTF-8.  So when PostgreSQL slurps that file up (into a UTF8-mode database), it decodes the UTF-8 once, and stores the resulting UTF-8 byte stream, kind of literally in the database.  The result is that the web page incorrectly displays the UTF-8 byte stream sort of literally.

Sorry, it's hard to describe this comprehensibly yet concisely.  Basically, the UTF-8 text is needlessly getting UTF-8-encoded by mysqldump, and the --compatible=postgres option stops it from doing that.  It might have to do with my original MySQL database thinking its columns were cp1252, but anyway it was MediaWiki that originally created that schema (ca. Mar. 2004), so I think I'm unlikely to be the only one to have this problem. I'm guessing it wouldn't affect wikis that have only ASCII text in their wikis.

Sorry if this is a duplicate bug.  I was not able to find any existing bugs about mediawiki_mysql2postgres.pl.

FWIW, mediawiki_mysql2postgres.pl in MW 1.11.0 still does not use "--compatible=postgres".
Comment 1 Greg Sabino Mullane 2007-10-09 14:16:09 UTC
Odd: mysqldump should use utf-8 by default. Historically, using compatible=postgres has caused other problems to popup, though I don't recall what they were now. This script has not been maintained lately: you might want to simply do a XML dump of your database and export/import that way instead.
Comment 2 Robert Fleming 2007-10-10 00:55:38 UTC
I'm up and running with PostgreSQL now, having successfully used the compatible=postgres option, so I personally won't need to go the XML route.

Indications are that mysqldump does use UTF-8 by default.  My best guess is that my root problem was that the MySQL server thought the data it contained was cp1252, and communicated that fact to the mysqldump client.  So mysqldump dutifully transcoded the "cp1252" text into UTF-8.  The only part I don't understand is why the "compatible=postgres" option would inhibit this behavior.

I did at one point attempt to change my MySQL columns from cp1252 to UTF-8 (via mysql-admin), but I think MySQL unfortunately understood that I was asking it to transcode the column -- of course I was only trying to inform it that the data it contained was not in the encoding it thought it was.  I probably could have fixed this by dumping the DB, tweaking the dump file, and reloading it.

Anyway, I'm not inclined to pursue/reproduce the issue -- I'd be happy to leave MySQL behind and not look back :) , (if only because I just know PostgreSQL a lot better).  So this bug can be resolved INVALID or WORKSFORME or whatever.  In any case it's good to have all of this on record in case anyone else runs into it.  Thanks.
Comment 3 Greg Sabino Mullane 2007-10-10 01:11:10 UTC
Thanks, I'll make a note at the top of the script and close the bug.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links