Last modified: 2014-02-12 23:40:00 UTC
I downloaded the latest version of the spanish articles in 'xml' and the latest version of mwdumper (2008-04-13): eswiki-20080507-pages-articles.xml.bz2 and I followed the instructions to load it in a mysql database. The exact line I type is: java -client -classpath mwdumper.jar;mysql-connector-java-3.1.12-bin.jar org.mediawiki.dumper.Dumper "--output=mysql://127.0.0.1/wikidb?user=<user>&password=<password>" "--format=sql:1.5" "C:\eswiki-20080507-pages-articles.xml.bz2" (where <user> and <password> are correctly especified). Everything seems to work ok, the output I get is: 1.000 pages (249,004/sec), 1.000 revs (249,004/sec) and similar lines starting with 2.000, 3.000... till it reaches the line starting with 17.000. At this point I get the following message: 17.000 pages (366,08/sec), 17.000 revs (366,08/sec) Exception in thread "main" java.io.IOException: java.sql.SQLException: Duplicate entry '0-?' for key 2 (and then the typical exception stack trace). I think maybe it could be something with the encoding of spanish accents (á, é....) or special characters such as 'ñ', so I tried creating the database with other charsets but I get the same error.
Please provide the stack trace.
Offhand it's likely an encoding issue; a possibly-default "Latin-1" schema will cause failure with this direct connection as the data will be converted from UTF-8 and titles will start to conflict when non-Latin-1 chars come in. A "UTF-8" schema may similarly cause failures when a title with a non-BMP character in it comes along, as MySQL's UTF-8 charset support is incomplete. If using the binary schema, things _should_ work.
Thank you very much for your help but it still doesn´t work... The stack trace is: 17.000 pages (87,04/sec), 17.000 revs (87,04/sec) Exception in thread "main" java.io.IOException: java.sql.SQLException: Duplicate entry '0-?' for key 2 at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source) Caused by: org.xml.sax.SAXException: java.sql.SQLException: Duplicate entry '0-?' for key 2 at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.sml.parsers.SAXParser.parse(Unknown Source) ... 2 more Caused by: java.io.IOException: java.sql.SQLException: Duplicate entry '0-?' for key 2 at org.mediawiki.importer.SqlServerStream.writeStatement(Unknown Source) at org.mediawiki.importer.SqlWriter.flushInsertBuffers(Unknown Source) at org.mediawiki.importer.SqlWriter.checkpoint(Unknown Source) at org.mediawiki.importer.SqlWriter15.updatePage(Unknown Source) at org.mediawiki.importer.SqlWriter15.writeEndPage(Unknown Source) at org.mediawiki.importer.MultiWriter.writeEndPage(Unknown Source) at org.mediawiki.importer.PageFilter.writeEndPage(Unknown Source) at org.mediawiki.importer.XmlDumpreader.closePage(Unknown Source) ... 14 more Sorry for my inexperience but, Brion, What do you mean with a "binary schema"?? I have 4 parameters wich could be "binary": - MySQL connection collation (could be set from phpMyAdmin) - Database collation (set while creating the 'wikidb' database) - MySQLCharSet (is set to UTF-8 Unicode but I can´t change it from phpMyAdmin. Should I change it? How can I change it?) - Database Character Set (I can set it in the MediaWiki configuration page with options: # Backwards-compatible UTF-8, # Experimental MySQL 4.1/5.0 UTF-8 or # Experimental MySQL 4.1/5.0 binary) I tried many configurations of this parameters but the problem persists. Could you help me, please? Thank you very much.
*** Bug 14958 has been marked as a duplicate of this bug. ***
Try to set the default-character-set in the my.ini or my.cnf (mysql\bin) of mysql to default-character-set="utf8" and restart the server.
(In reply to comment #5) > Try to set the default-character-set in the my.ini or my.cnf (mysql\bin) of > mysql to > default-character-set="utf8" > and restart the server. You can also append &characterEncoding=UTF-8 to the --output parameter
I've hit a similar encoding bug while importing enwiki. I was piping to sql using this cmdline: bunzip2 -c enwiki-20120104-pages-articles.xml.bz2 | mwdumper --format=sql:1.5 > out.sql Exception in thread "main" java.io.IOException: not a name start character: "U+26" at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper) at org.mediawiki.dumper.Dumper.main(mwdumper) Caused by: org.xml.sax.SAXParseException: not a name start character: "U+26" at gnu.xml.stream.SAXParser.parse(libgcj.so.10) at javax.xml.parsers.SAXParser.parse(libgcj.so.10) at javax.xml.parsers.SAXParser.parse(libgcj.so.10) at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper) ...1 more Caused by: javax.xml.stream.XMLStreamException: not a name start character: "U+26" at gnu.xml.stream.XMLParser.error(libgcj.so.10) at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.10) at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.10) at gnu.xml.stream.XMLParser.readCharData(libgcj.so.10) at gnu.xml.stream.XMLParser.next(libgcj.so.10) at gnu.xml.stream.SAXParser.parse(libgcj.so.10) ...4 more