Last modified: 2013-02-06 10:30:47 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T20328, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 18328 - mwdumper java.lang.IllegalArgumentException: Invalid contributor
mwdumper java.lang.IllegalArgumentException: Invalid contributor
Status: RESOLVED WORKSFORME
Product: Utilities
Classification: Unclassified
mwdumper (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Brion Vibber
: patch, patch-need-review
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-03 17:09 UTC by Robert B
Modified: 2013-02-06 10:30 UTC (History)
14 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
test file to reproduce (7.15 KB, text/xml)
2009-05-28 01:38 UTC, Sebastian Hellmann
Details
proposed fix (10.33 KB, patch)
2009-05-28 01:45 UTC, Sebastian Hellmann
Details
diff for patching (1.08 KB, patch)
2009-07-07 15:06 UTC, Sebastian Hellmann
Details

Description Robert B 2009-04-03 17:09:50 UTC
Trying to convert the simple English wikipedia xml dump to an sql file (i.e. without on-the-fly insert into database), I get a Java exception after partial successful conversion. Here's what is displayed:

...
8,740 pages (41.603/sec), 376,000 revs (1,789.777/sec)
8,778 pages (41.713/sec), 377,000 revs (1,791.493/sec)
8,801 pages (41.713/sec), 378,000 revs (1,791.554/sec)
Exception in thread "main" java.lang.IllegalArgumentException: Invalid contributor
	at org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown Source)
	at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
	at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at javax.xml.parsers.SAXParser.parse(Unknown Source)
	at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
	at org.mediawiki.dumper.Dumper.main(Unknown Source)

Versions:

OS: Linux 2.6.17-1.2142 (Fedora Core 4)
Java: 1.6.0_13-b03
mwdumper: 2008-04-13
Data: Simple English Wikipedia dump of 2009-03-30

Invocation:

java -Xmx512m -Xms128m -XX:NewSize=32m -XX:MaxNewSize=64m -XX:SurvivorRatio=6 -XX:+UseParallelGC -XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1 -server -jar mwdumper.jar --format=sql:1.5 simplewiki-20090330-pages-meta-history.xml > simplewiki-20090330-pages-meta-history.sql &


What's up and how to fix this problem?
Comment 1 Robert B 2009-04-05 05:40:44 UTC
Having looked further into this issue I think I have isolated the problem. As of recent versions of MediaWiki (from around end of 2008) it is possible for the text, comment and/or contributor of a revision to be completely deleted for legal reasons (copyright infringement, libel, etc). This is mentioned in more detail here: http://www.mediawiki.org/wiki/Bitfields_for_rev_deleted

An example of a revision with a deleted contributor in the 2009-03-30 dump of Simple English Wikipedia looks like this:

    <revision>
      <id>1460119</id>
      <timestamp>2009-03-30T11:34:51Z</timestamp>
      <contributor deleted="deleted" />
      <comment>Replaced content with 'Majorly is heaps shit'</comment>
      <text xml:space="preserve">Majorly is heaps shit</text>
    </revision>

When mwdumper encounters the contributor element <contributor deleted="deleted" /> it chokes on it.

So it looks like the code needs to be fixed to be able to handle deleted contributors. Question is, what should be put in place of the contributor name?
Comment 2 Sebastian Hellmann 2009-05-28 01:38:05 UTC
Created attachment 6165 [details]
test file to reproduce
Comment 3 Sebastian Hellmann 2009-05-28 01:43:58 UTC
I fixed the bug for myself.
its probably not the nicest code, but it'll work.
UserIP is set to 127.0.0.1
Hope it will help.



in XMLDumpReader.java
(I attached my version)

line 152: else if (qName == "contributor") openContributor(attributes);
and about line 333

void openContributor(Attributes attribs) {
		String deleted = attribs.getValue("deleted");
		if(	deleted !=null && deleted.equals("deleted")){
			contrib = new Contributor("127.0.0.1");
		}else{
			contrib = null;
		}
	}
Comment 4 Sebastian Hellmann 2009-05-28 01:45:12 UTC
Created attachment 6166 [details]
proposed fix
Comment 5 Chad H. 2009-06-10 17:57:44 UTC
Please attach all patches as a unified diff against trunk, rather than the complete file.
Comment 6 Srinivasan Ramaswamy 2009-06-20 17:12:19 UTC
can you post the jar somewhere so that people who want to use a working version of mwdumper can use it ? i made the code changes but i dont have the reqd libs to compile the package.

(In reply to comment #3)
> I fixed the bug for myself.
> its probably not the nicest code, but it'll work.
> UserIP is set to 127.0.0.1
> Hope it will help.
> 
> 
> 
> in XMLDumpReader.java
> (I attached my version)
> 
> line 152: else if (qName == "contributor") openContributor(attributes);
> and about line 333
> 
> void openContributor(Attributes attribs) {
>                 String deleted = attribs.getValue("deleted");
>                 if(     deleted !=null && deleted.equals("deleted")){
>                         contrib = new Contributor("127.0.0.1");
>                 }else{
>                         contrib = null;
>                 }
>         }
> 

Comment 7 Thomas Seifert 2009-06-23 15:25:06 UTC
I'd appreciate that too as I couldn't find a link to even download the source to apply this change.
Comment 8 Sebastian Hellmann 2009-07-07 14:56:01 UTC
here is a link for the jar file that fixes the bug

http://downloads.dbpedia.org/mwdumper_invalid_contributor.zip
Comment 9 Sebastian Hellmann 2009-07-07 15:06:00 UTC
Created attachment 6303 [details]
diff for patching

I used:
svn diff >  invalid.contibutor.patch

hope this is correct.

@Chad: if not, please tell me how to create a unified diff, as this is the first time I tried to create one.
Comment 10 Chad H. 2009-07-07 15:09:30 UTC
The patch is correct yes, but I'm not sure I like your proposed fix. Correct me if I'm wrong, but basically you're saying if the rev has been deleted to set the contributor to 127.0.0.1?
Comment 11 Sebastian Hellmann 2009-07-07 16:06:22 UTC
yes. I was not sure what to put there.
127.0.0.1 was a reasonable choice, because I was sure MediaWiki can handle it.
Other options I considered was user 'deleted' or maybe leave it blank (where again I wasn't sure if mediawiki or the mysql-db would choke on it).

I can change it again, but I'm not sure what would be the best.

if (contrib == null){
	throw new IllegalArgumentException("Invalid contributor");
}

This code says that contributor should not be null.

So if it is set to null, I'm quite sure the program will break at another point, throwing a NullPointer Exception.

Basically, it is a hack, but still better than not being able to import WikipediaDumps at all.
(And sorry for not answering for such a long time, I was on holiday for a month.)
Comment 12 Martin 2009-07-13 03:01:49 UTC
(In reply to comment #8)
> here is a link for the jar file that fixes the bug
> 
> http://downloads.dbpedia.org/mwdumper_invalid_contributor.zip
> 

I downloaded the jar file specified above and there seems to be additionally changes to the file.  I cannot get it to output SQL for schema 1.4 or 1.5.  I downloaded the file twice and ran it against the enwikipedia dump and it completed successfully.  However when I looked at the file it was XML, not SQL.  I then downloaded the production version of the mwdumper.jar, http://download.wikimedia.org/tools/mwdumper.jar, with the same command line and it died in due to the bug, but it also put out SQL as requested.  For clarity the command line was java -jar mwdumper.jar --format=sql1.4 --output=file:test.sql enwikipedia-20090708.xml.  Am I missing something or is there an issue in the program with processing flags for SQL output?  
Comment 13 Grozny 2009-07-25 20:49:01 UTC
I've just tested this version of mwdumper and it correctly produced sql file (I tried both 1.4 and 1.5 formats).
Perhaps there's a syntax error in your invokation code.
It should be --format=sql:1.5, not --format=sql1.5

But I got another problem.
After generating sql file from xml one this way
java -jar mwdumper.jar --format=sql:1.5 enwiki-20090713-pages-articles.xml > import20090713.sql
I import import20090713.sql into mysql database, but I only get 2,700,000 rows in page and revision tables and 2,700,937 rows in text table.
While it should be 8801763 pages according to http://download.wikimedia.org/enwiki/20090713/



(In reply to comment #12)
> I downloaded the jar file specified above and there seems to be additionally
> changes to the file.  I cannot get it to output SQL for schema 1.4 or 1.5.  I
> downloaded the file twice and ran it against the enwikipedia dump and it
> completed successfully.  However when I looked at the file it was XML, not SQL.
>  I then downloaded the production version of the mwdumper.jar,
> http://download.wikimedia.org/tools/mwdumper.jar, with the same command line
> and it died in due to the bug, but it also put out SQL as requested.  For
> clarity the command line was java -jar mwdumper.jar --format=sql1.4
> --output=file:test.sql enwikipedia-20090708.xml.  Am I missing something or is
> there an issue in the program with processing flags for SQL output?  
> 

Comment 14 Robert Stojnic 2009-08-19 10:46:34 UTC
Cannot reproduce using the test file with latest mwdumper from SVN. Also ran the conversion on latest simplewiki history snapshot (20090817) - went clean. So, did someone fix this or what? 

Closing worksforme.
Comment 15 arjun mehta 2010-07-10 14:30:29 UTC
Is there any way someone could provide a mirror to http://downloads.dbpedia.org/mwdumper_invalid_contributor.zip.
The link seems to point to a server that is down or unavailable.
Comment 16 Sebastian Hellmann 2010-07-10 16:34:18 UTC
It will be up on Tuesday again, we are doing server maintenance...
BTW: I think it should be fixed in the original code by now .
Did you try?
Comment 17 arjun mehta 2010-07-10 16:47:16 UTC
Thanks Sebastian,

I've tried to compile it but I don't have the gcj compiler (OS X), so I came to a roadblock there. AND I'm not that savvy with working with source packages. :)

I'm sure there are many others like me... it would be great if the latest compiled JAR file was made available to the general public at all times! The latest one that's linked from the MediaWiki site is from 2007 and as we know, can't really handle the more recent xml dumps.
Comment 18 Sebastian Hellmann 2010-07-12 18:41:02 UTC
ok, I deleted my version. But I think Daniel Kinzler fixed it in the code back then, so I just compiled again and uploaded it on the newly made server.

Here you go:
http://downloads.dbpedia.org/

http://downloads.dbpedia.org/mwdumpedr.jar

to compile:
-------------------
svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/mwdumper mwdumper
cd mwdumper
ant jar
-------------------

btw. The goal of the DBpedia project is to provide structured data extracted from Wikipedia in machine readable format (see http://dbpedia.org). I think for the most common use cases (like getting a list of all title of articles in Wikipedia or all geocordinates) data in DBpedia should be well sufficient. We plan to inlcude provenance data also. Just to mention an alternative to getting the mwdumper and extracting information yourself...

Hope I could help,
Sebastian
Comment 19 Sebastian Hellmann 2010-07-12 18:42:03 UTC
small correction, sorry: 

http://downloads.dbpedia.org/mwdumper.jar
Comment 20 arjun mehta 2010-07-12 20:30:19 UTC
Sebastian, thank you so much for this!
Hopefully this will be useful for others in the same position as me down the line.

I will try compiling it using your instructions, but this should help in the interim.

I've certainly looked into DBpedia, and you provide a really great alternative to data provision. Amazing work!

The wikipedia databases can be a bit unwieldy, and I feel like mediawiki needs a bit more capability with it's Special Export. (eg. Access to detailed Category info through GET).

Anyway, thank you so much! Huge help.
Arjun
Comment 21 arjun mehta 2010-07-17 13:55:11 UTC
The compile worked and the latest source seems to have this issue resolved. Media wiki should have these instructions on the mwdumper.jar page. I'll try to add it now. :)

Thanks again
Arjun
Comment 22 JulesWinnfield-hu 2012-10-19 12:31:44 UTC
I've had the same problem. A fix should be merged, to be merged with other fixes.
Comment 23 Andre Klapper 2012-10-19 19:54:56 UTC
Bean49: Could you please elaborate exactly what "the same problem" means, by providing exact steps and URLs in order to reproduce? Thanks!
Comment 24 JulesWinnfield-hu 2012-10-19 20:53:08 UTC
(In reply to comment #23)
Sorry! I used the jar from http://download.wikimedia.org/tools/mwdumper.jar and I should have not. Thanks for your attention.
Comment 25 Singh 2013-02-06 01:19:05 UTC
Hey guys, 
   I am getting the same problem: IllegalArgument...Invalid Contributor.
Can some suggest to me....which  mwdumper file to use...and where it is available...? 
Also, can someone please post link to Text, Revision and Page Tables required to transfer data to MySQL. I don't know the format of these tables....
Comment 26 Andre Klapper 2013-02-06 10:30:47 UTC
Singh: The first result in an internet search engine was http://www.mediawiki.org/wiki/Manual:MWDumper for me... Please ask followup questions on https://www.mediawiki.org/wiki/Project:Support_desk . Thanks!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links