Last modified: 2010-05-15 15:50:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T13608, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 11608 - should mediawiki_mysql2postgres.pl use "--compatible=postgres"?


Summary:	should mediawiki_mysql2postgres.pl use "--compatible=postgres"?

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Database (Other open bugs)
Version:	1.10.x
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2007-10-08 23:37 UTC by Robert Fleming
Modified:	2010-05-15 15:50 UTC (History)
CC List:	1 user (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Robert Fleming 2007-10-08 23:37:33 UTC

Should maintenance/postgres/mediawiki_mysql2postgres.pl use "--compatible=postgres" for mysqldump?

I was migrating from a MySQL to PostgreSQL. My MySQL (stock Debian install) database contains UTF-8-encoded text, although every column of every table I checked was set to use "Character Set" "cp1252"; I don't know why, and I don't know what that setting affects, but anyway my MediaWiki installation had no problem pulling the UTF-8 data from MySQL and properly presenting it to the web browser.

mysqldump, using the options as specified in mediawiki_mysql2postgres.pl, seems to take the UTF-8-encoded data, treat it as latin1 (or cp1252?), and re-encode that text as UTF-8. So when PostgreSQL slurps that file up (into a UTF8-mode database), it decodes the UTF-8 once, and stores the resulting UTF-8 byte stream, kind of literally in the database. The result is that the web page incorrectly displays the UTF-8 byte stream sort of literally.

Sorry, it's hard to describe this comprehensibly yet concisely. Basically, the UTF-8 text is needlessly getting UTF-8-encoded by mysqldump, and the --compatible=postgres option stops it from doing that. It might have to do with my original MySQL database thinking its columns were cp1252, but anyway it was MediaWiki that originally created that schema (ca. Mar. 2004), so I think I'm unlikely to be the only one to have this problem. I'm guessing it wouldn't affect wikis that have only ASCII text in their wikis.

Sorry if this is a duplicate bug. I was not able to find any existing bugs about mediawiki_mysql2postgres.pl.

FWIW, mediawiki_mysql2postgres.pl in MW 1.11.0 still does not use "--compatible=postgres".

Comment 1 Greg Sabino Mullane 2007-10-09 14:16:09 UTC

Odd: mysqldump should use utf-8 by default. Historically, using compatible=postgres has caused other problems to popup, though I don't recall what they were now. This script has not been maintained lately: you might want to simply do a XML dump of your database and export/import that way instead.

Comment 2 Robert Fleming 2007-10-10 00:55:38 UTC

I'm up and running with PostgreSQL now, having successfully used the compatible=postgres option, so I personally won't need to go the XML route.

Indications are that mysqldump does use UTF-8 by default.  My best guess is that my root problem was that the MySQL server thought the data it contained was cp1252, and communicated that fact to the mysqldump client.  So mysqldump dutifully transcoded the "cp1252" text into UTF-8.  The only part I don't understand is why the "compatible=postgres" option would inhibit this behavior.

I did at one point attempt to change my MySQL columns from cp1252 to UTF-8 (via mysql-admin), but I think MySQL unfortunately understood that I was asking it to transcode the column -- of course I was only trying to inform it that the data it contained was not in the encoding it thought it was.  I probably could have fixed this by dumping the DB, tweaking the dump file, and reloading it.

Anyway, I'm not inclined to pursue/reproduce the issue -- I'd be happy to leave MySQL behind and not look back :) , (if only because I just know PostgreSQL a lot better).  So this bug can be resolved INVALID or WORKSFORME or whatever.  In any case it's good to have all of this on record in case anyone else runs into it.  Thanks.

Comment 3 Greg Sabino Mullane 2007-10-10 01:11:10 UTC

Thanks, I'll make a note at the top of the script and close the bug.

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links