Last modified: 2014-02-12 23:40:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T16379, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 14379 - mwdumper crashes on non-latin input characters


Summary:	mwdumper crashes on non-latin input characters

Status:	NEW

Product:	Utilities
Classification:	Unclassified
Component:	mwdumper (Other open bugs)
Version:	unspecified
Hardware:	PC Windows XP

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Brion Vibber

URL:
Whiteboard:
Keywords:

Duplicates:	14958 (view as bug list)
Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2008-06-02 11:15 UTC by Jesus
Modified:	2014-02-12 23:40 UTC (History)
CC List:	1 user (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Jesus 2008-06-02 11:15:02 UTC

I downloaded the latest version of the spanish articles in 'xml' and the latest version of mwdumper (2008-04-13):

eswiki-20080507-pages-articles.xml.bz2

and I followed the instructions to load it in a mysql database. The exact line I type is:

java -client -classpath mwdumper.jar;mysql-connector-java-3.1.12-bin.jar org.mediawiki.dumper.Dumper "--output=mysql://127.0.0.1/wikidb?user=<user>&password=<password>" "--format=sql:1.5" "C:\eswiki-20080507-pages-articles.xml.bz2"

(where <user> and <password> are correctly especified).

Everything seems to work ok, the output I get is:

1.000 pages (249,004/sec), 1.000 revs (249,004/sec)

and similar lines starting with 2.000, 3.000... till it reaches the line starting with 17.000. At this point I get the following message:

17.000 pages (366,08/sec), 17.000 revs (366,08/sec)
Exception in thread "main" java.io.IOException: java.sql.SQLException: Duplicate entry '0-?' for key 2

(and then the typical exception stack trace).


I think maybe it could be something with the encoding of spanish accents (á, é....) or special characters such as 'ñ', so I tried creating the database with other charsets but I get the same error.

Comment 1 MER-C 2008-06-04 11:08:58 UTC

Please provide the stack trace.

Comment 2 Brion Vibber 2008-06-04 17:18:20 UTC

Offhand it's likely an encoding issue; a possibly-default "Latin-1" schema will cause failure with this direct connection as the data will be converted from UTF-8 and titles will start to conflict when non-Latin-1 chars come in. A "UTF-8" schema may similarly cause failures when a title with a non-BMP character in it comes along, as MySQL's UTF-8 charset support is incomplete.

If using the binary schema, things _should_ work.

Comment 3 Jesus 2008-06-05 08:37:59 UTC

Thank you very much for your help but it still doesn´t work...

The stack trace is:

17.000 pages (87,04/sec), 17.000 revs (87,04/sec)
Exception in thread "main" java.io.IOException: java.sql.SQLException: Duplicate entry '0-?' for key 2
	at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
	at org.mediawiki.dumper.Dumper.main(Unknown Source)
Caused by: org.xml.sax.SAXException: java.sql.SQLException: Duplicate entry '0-?' for key 2
	at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at javax.sml.parsers.SAXParser.parse(Unknown Source)
	... 2 more
Caused by: java.io.IOException: java.sql.SQLException: Duplicate entry '0-?' for key 2
	at org.mediawiki.importer.SqlServerStream.writeStatement(Unknown Source)
	at org.mediawiki.importer.SqlWriter.flushInsertBuffers(Unknown Source)
	at org.mediawiki.importer.SqlWriter.checkpoint(Unknown Source)
	at org.mediawiki.importer.SqlWriter15.updatePage(Unknown Source)
	at org.mediawiki.importer.SqlWriter15.writeEndPage(Unknown Source)
	at org.mediawiki.importer.MultiWriter.writeEndPage(Unknown Source)
	at org.mediawiki.importer.PageFilter.writeEndPage(Unknown Source)
	at org.mediawiki.importer.XmlDumpreader.closePage(Unknown Source)
	... 14 more


Sorry for my inexperience but, Brion, What do you mean with a "binary schema"?? I have 4 parameters wich could be "binary":

- MySQL connection collation (could be set from phpMyAdmin)
- Database collation (set while creating the 'wikidb' database) 
- MySQLCharSet (is set to UTF-8 Unicode but I can´t change it from phpMyAdmin. Should I change it? How can I change it?)
- Database Character Set (I can set it in the MediaWiki configuration page with options: # Backwards-compatible UTF-8,
  # Experimental MySQL 4.1/5.0 UTF-8 or # Experimental MySQL 4.1/5.0 binary)

I tried many configurations of this parameters but the problem persists. Could you help me, please?

Thank you very much.

Comment 4 Max Semenik 2008-08-22 09:16:53 UTC

*** Bug 14958 has been marked as a duplicate of this bug. ***

Comment 5 Umherirrender 2010-04-04 17:36:47 UTC

Try to set the default-character-set in the my.ini or my.cnf (mysql\bin) of mysql to

default-character-set="utf8"

and restart the server.

Comment 6 Umherirrender 2010-04-04 17:55:37 UTC

(In reply to comment #5)
> Try to set the default-character-set in the my.ini or my.cnf (mysql\bin) of
> mysql to
> default-character-set="utf8"
> and restart the server.

You can also append

&characterEncoding=UTF-8

to the --output parameter

Comment 7 Adam Wight 2012-01-19 17:42:46 UTC

I've hit a similar encoding bug while importing enwiki.  I was piping to sql using this cmdline:

  bunzip2 -c enwiki-20120104-pages-articles.xml.bz2 | mwdumper --format=sql:1.5 > out.sql


Exception in thread "main" java.io.IOException: not a name start character: "U+26"
   at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
   at org.mediawiki.dumper.Dumper.main(mwdumper)
Caused by: org.xml.sax.SAXParseException: not a name start character: "U+26"
   at gnu.xml.stream.SAXParser.parse(libgcj.so.10)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.10)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.10)
   at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
   ...1 more
Caused by: javax.xml.stream.XMLStreamException: not a name start character: "U+26"
   at gnu.xml.stream.XMLParser.error(libgcj.so.10)
   at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.10)
   at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.10)
   at gnu.xml.stream.XMLParser.readCharData(libgcj.so.10)
   at gnu.xml.stream.XMLParser.next(libgcj.so.10)
   at gnu.xml.stream.SAXParser.parse(libgcj.so.10)
   ...4 more

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links