Last modified: 2013-06-18 13:26:46 UTC
C:\dumper>java -client -classpath mwdumper.jar;mysql-connector-java-3.1.14/mysql
-connector-java-3.1.14-bin.jar org.mediawiki.dumper.Dumper "--output=mysql://127
.0.0.1/wikiar?user=usr&password=pass" "--format=sql:1.5" "D:\arwiki
1,000 pages (25.65/sec), 1,000 revs (25.65/sec)
2,000 pages (20.713/sec), 2,000 revs (20.713/sec)
3,000 pages (24.385/sec), 3,000 revs (24.385/sec)
4,000 pages (24.352/sec), 4,000 revs (24.352/sec)
5,000 pages (25.293/sec), 5,000 revs (25.293/sec)
Exception in thread "main" java.io.IOException: com.mysql.jdbc.MysqlDataTruncati
on: Data truncation: Data too long for column 'rev_comment' at row 809
at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
at org.mediawiki.dumper.Dumper.main(Unknown Source)
java version "1.6.0_04"
Java(TM) SE Runtime Environment (build 1.6.0_04-b12)
Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)
Can you double-check that the proper encoding's being used?
The most compatible case is probably to use the binary schema. You may or may not have troubles with other modes.
I have tried all three:
# Backwards-compatible UTF-8
# Experimental MySQL 4.1/5.0 UTF-8
# Experimental MySQL 4.1/5.0 binary
But I get the exact error (I drop the db then reinstall mw).. does mwdumper has some encoding schema setting I should change?
I found that the error isn't from mwdumper but from the data dumps. the problem is that it is trying to put too much data and the column type is small. when i changed rev_comment from tinyblob to blob..it imported without errors. should it be changed in mediawiki or what?
As a workaround until the dump is fixed, mwdumper should make sure that a comment is at most 255 bytes long and truncate it if necessary. I implemented this fix and checked it in at http://dbpedia.svn.sourceforge.net/viewvc/dbpedia?view=rev&revision=1771 . Seems to fix that problem for me. Feel free to copy that code back to mediawiki if you want.
Ahhhh ok I think I see the base issue -- if a 2-byte or 3-byte char is cut off at the 255-byte boundary when stored, it becomes an invalid char. The XML dump outputter runs UTF-8 validation and turns the bad char into a valid U+FFFD ... which is 3 bytes of UTF-8, over the 255-char limit again.
Yeah, this should be fixed in our DB and MediaWiki should be smarter about truncation, but in the meantime it should be easy to make mwdumper smarter for this too.
Created attachment 7263 [details]
truncate comment at 255 Bytes
It also works when you append
to the --output parameter.
But I have also add a patch to truncate the comment. Based on the implementation of Christopher Sahnwaldt (comment 4).
Minor gripe: the patch uses String.isEmpty(), which was only added in JDK 1.6. Maybe use String.length() == 0 instead, so MWDumper still compiles under 1.5.
This doesn't suddently a blocker after 3 years of existence... :)
(In reply to comment #9)
Why it isn't? Mwdumper can't be used to import dumps because of this bug.
See comment 6 for a workaround
(In reply to comment #11)
Unfortunately it works only for the jdbc connector, and it's not a solution for the sql output, is it?
(In reply to comment #12)
> (In reply to comment #11)
> Unfortunately it works only for the jdbc connector, and it's not a solution for
> the sql output, is it?
Yes, that is true. For the raw sql this is not a solution.
(In reply to comment #11)
> See comment 6 for a workaround
Didn't work for me. Still gives Data too long for column 'rev_comment'.
Gerrit change Ic078f6ee.
Chane Ic078f6ee is merged now.