Last modified: 2010-05-15 15:50:40 UTC
Should maintenance/postgres/mediawiki_mysql2postgres.pl use "--compatible=postgres" for mysqldump?
I was migrating from a MySQL to PostgreSQL. My MySQL (stock Debian install) database contains UTF-8-encoded text, although every column of every table I checked was set to use "Character Set" "cp1252"; I don't know why, and I don't know what that setting affects, but anyway my MediaWiki installation had no problem pulling the UTF-8 data from MySQL and properly presenting it to the web browser.
mysqldump, using the options as specified in mediawiki_mysql2postgres.pl, seems to take the UTF-8-encoded data, treat it as latin1 (or cp1252?), and re-encode that text as UTF-8. So when PostgreSQL slurps that file up (into a UTF8-mode database), it decodes the UTF-8 once, and stores the resulting UTF-8 byte stream, kind of literally in the database. The result is that the web page incorrectly displays the UTF-8 byte stream sort of literally.
Sorry, it's hard to describe this comprehensibly yet concisely. Basically, the UTF-8 text is needlessly getting UTF-8-encoded by mysqldump, and the --compatible=postgres option stops it from doing that. It might have to do with my original MySQL database thinking its columns were cp1252, but anyway it was MediaWiki that originally created that schema (ca. Mar. 2004), so I think I'm unlikely to be the only one to have this problem. I'm guessing it wouldn't affect wikis that have only ASCII text in their wikis.
Sorry if this is a duplicate bug. I was not able to find any existing bugs about mediawiki_mysql2postgres.pl.
FWIW, mediawiki_mysql2postgres.pl in MW 1.11.0 still does not use "--compatible=postgres".
Odd: mysqldump should use utf-8 by default. Historically, using compatible=postgres has caused other problems to popup, though I don't recall what they were now. This script has not been maintained lately: you might want to simply do a XML dump of your database and export/import that way instead.
I'm up and running with PostgreSQL now, having successfully used the compatible=postgres option, so I personally won't need to go the XML route.
Indications are that mysqldump does use UTF-8 by default. My best guess is that my root problem was that the MySQL server thought the data it contained was cp1252, and communicated that fact to the mysqldump client. So mysqldump dutifully transcoded the "cp1252" text into UTF-8. The only part I don't understand is why the "compatible=postgres" option would inhibit this behavior.
I did at one point attempt to change my MySQL columns from cp1252 to UTF-8 (via mysql-admin), but I think MySQL unfortunately understood that I was asking it to transcode the column -- of course I was only trying to inform it that the data it contained was not in the encoding it thought it was. I probably could have fixed this by dumping the DB, tweaking the dump file, and reloading it.
Anyway, I'm not inclined to pursue/reproduce the issue -- I'd be happy to leave MySQL behind and not look back :) , (if only because I just know PostgreSQL a lot better). So this bug can be resolved INVALID or WORKSFORME or whatever. In any case it's good to have all of this on record in case anyone else runs into it. Thanks.
Thanks, I'll make a note at the top of the script and close the bug.