Last modified: 2014-02-12 23:39:50 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T24137, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 22137 - GCJ library bug: mwdumper dies with "not a name start character: "U+26"" error
GCJ library bug: mwdumper dies with "not a name start character: "U+26"" error
Status: NEW
Product: Utilities
Classification: Unclassified
mwdumper (Other open bugs)
unspecified
PC Linux
: Normal critical (vote)
: ---
Assigned To: Brion Vibber
http://gcc.gnu.org/bugzilla/show_bug....
: upstream
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-01-18 08:36 UTC by Kelson [Emmanuel Engelhart]
Modified: 2014-02-12 23:39 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Problematic part of the XML dump (7.15 KB, application/xml)
2010-01-18 11:37 UTC, Kelson [Emmanuel Engelhart]
Details
Much more simpler case that demonstrates error (25 bytes, application/xml)
2010-02-12 19:32 UTC, Bawolff (Brian Wolff)
Details
Much more simpler java code that demonstrates error (483 bytes, text/x-java)
2010-02-13 11:11 UTC, Kelson [Emmanuel Engelhart]
Details

Description Kelson [Emmanuel Engelhart] 2010-01-18 08:36:50 UTC
$mwdumper --format=sql:1.5 itwiki-20100108-pages-articles.xml.bz2 | lzma -c > itwiki-20100108-pages-articles.sql.lzma
1000 pages (88,755/sec), 1000 revs (88,755/sec)
2000 pages (65,935/sec), 2000 revs (65,935/sec)
3000 pages (67,621/sec), 3000 revs (67,621/sec)
4000 pages (80,336/sec), 4000 revs (80,336/sec)
5000 pages (80,457/sec), 5000 revs (80,457/sec)
Exception in thread "main" java.io.IOException: not a name start character: "U+26"
   at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
   at org.mediawiki.dumper.Dumper.main(mwdumper)
Caused by: org.xml.sax.SAXParseException: not a name start character: "U+26"
   at gnu.xml.stream.SAXParser.parse(libgcj.so.81)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
   at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
   ...1 more
Caused by: javax.xml.stream.XMLStreamException: not a name start character: "U+26"
   at gnu.xml.stream.XMLParser.error(libgcj.so.81)
   at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81)
   at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81)
   at gnu.xml.stream.XMLParser.readCharData(libgcj.so.81)
   at gnu.xml.stream.XMLParser.next(libgcj.so.81)
   at gnu.xml.stream.XMLParser.hasNext(libgcj.so.81)
   at gnu.xml.stream.SAXParser.parse(libgcj.so.81)
   ...4 more
Comment 1 Kelson [Emmanuel Engelhart] 2010-01-18 10:59:50 UTC
Hier is a diff adding column and line information to the exception informations:

===================================================================
--- src/org/mediawiki/importer/XmlDumpReader.java       (révision 61197)
+++ src/org/mediawiki/importer/XmlDumpReader.java       (copie de travail)
@@ -36,6 +36,7 @@
 import javax.xml.parsers.ParserConfigurationException;
 import javax.xml.parsers.SAXParser;
 import javax.xml.parsers.SAXParserFactory;
+import org.xml.sax.SAXParseException;

 import org.xml.sax.Attributes;
 import org.xml.sax.SAXException;
@@ -82,15 +83,17 @@
         */
        public void readDump() throws IOException {
                try {
-                       SAXParserFactory factory = SAXParserFactory.newInstance();
-                       SAXParser parser = factory.newSAXParser();
+                   SAXParserFactory factory = SAXParserFactory.newInstance();
+                   SAXParser parser = factory.newSAXParser();

                        parser.parse(input, this);
                } catch (ParserConfigurationException e) {
                        throw (IOException)new IOException(e.getMessage()).initCause(e);
+               } catch (SAXParseException e) {
+                   throw (IOException)new IOException(e.getMessage() + " (line: " + e.getLineNumber() + " column: " + e.getColumnNumber() + ")").initCause(e);
                } catch (SAXException e) {
-                       throw (IOException)new IOException(e.getMessage()).initCause(e);
-               }
+                   throw (IOException)new IOException(e.getMessage()).initCause(e);
+               }
                writer.close();
        }
Comment 2 Kelson [Emmanuel Engelhart] 2010-01-18 11:37:36 UTC
Created attachment 6965 [details]
Problematic part of the XML dump

I have extract the problematic part of the dump, see attachment.

$ mwdumper --format=sql:1.5 sample.xml.bz2 | lzma -c -d > sample.sql.lzma
Exception in thread "main" java.io.IOException: not a name start character: "U+26" (line: 82 column: 1)
   at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
   at org.mediawiki.dumper.Dumper.main(mwdumper)
Caused by: org.xml.sax.SAXParseException: not a name start character: "U+26"
   at gnu.xml.stream.SAXParser.parse(libgcj.so.81)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
   at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
   ...1 more
Caused by: javax.xml.stream.XMLStreamException: not a name start character: "U+26"
   at gnu.xml.stream.XMLParser.error(libgcj.so.81)
   at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81)
   at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81)
   at gnu.xml.stream.XMLParser.readCharData(libgcj.so.81)
   at gnu.xml.stream.XMLParser.next(libgcj.so.81)
   at gnu.xml.stream.XMLParser.hasNext(libgcj.so.81)
   at gnu.xml.stream.SAXParser.parse(libgcj.so.81)
   ...4 more
Comment 3 Bawolff (Brian Wolff) 2010-02-12 19:32:37 UTC
Created attachment 7114 [details]
Much more simpler case that demonstrates error

This is a unicode issue. If you remove the
Comment 4 Bawolff (Brian Wolff) 2010-02-12 19:33:19 UTC
Bugzilla screwed up my comment:

This is a unicode issue. If you remove the 
Comment 5 Bawolff (Brian Wolff) 2010-02-12 19:34:53 UTC
Ok, apparently bugzilla suffers from the same issue as mwdumper ;)

This is a unicode issue. If you remove the <Unicode character removed from comment, lest bugzilla hate me> ( U+1D59F - MATHEMATICAL BOLD FRAKTUR SMALL Z - however the article claims it to be U+1D537 which is MATHEMATICAL FRAKTUR SMALL Z  but thats not what character is in the text. ) everything works fine. Since its not chocking on more ordinary unicode characters, i imagine its something to do with that character being a 4-byte character.

It also appears that this interacts with other stuff in the file, as it doesn't cause the error by itself. 

Specifically entity references, seem to be what causes it to die after encountering the unicode character. I think It interpert that & character as starting as outside the tag name (hence starting a new tag, but & (aka U+0026) cannot start a new tag). Newline characters may also have something to do with it, as removing the newline between the unicode character and the & changes the error message.

Changing summary to more adequately reflect what i think the problem is.

Attaching simpler test case.

Note also, that if you replace the unicode character with its entity reference (&#x1D59F;), everything works fine.
Comment 6 Platonides 2010-02-12 22:58:48 UTC
Java internally uses UTF-16

"The native coded character set of the Java programming language is that of the first seventeen planes of the Unicode version 3.0 character set; that is, it consists in the basic multilingual plane (BMP) of Unicode version 1 plus the next sixteen planes of Unicode version 3. This is because the language's internal representation of characters uses the UTF-16 encoding, which encodes the BMP directly and uses surrogate pairs, a simple escape mechanism, to encode the other planes. Hence a charset in the Java platform defines a mapping between sequences of sixteen-bit values in UTF-16 and sequences of bytes."
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html
http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html

The file contains U+01D59F in UTF-8, thus F0 9D 96 9F. In binary 11110000 10011101 10010110 10011111
I don't see why it is reading a U+26 (100110).


PS: Maybe bugzilla is using mysql as utf-8 instead of binary? mysql unicode currently only supports the BMP.
Comment 7 Bawolff (Brian Wolff) 2010-02-12 23:41:45 UTC
>Java internally uses UTF-16
yes it does, but i think the file is interperted as utf-8, otherwise it wouldn't be able to make sense of it at all, as utf-8 and utf-16 look fairly different for your average english text (I'm under the impression that utf-16 is not compatible with ASCII thus nothing would work at all if it was using utf-16). 


>I don't see why it is reading a U+26 (100110).

The entity references that come after the problematic unicode character is where the U+26 (&) comes from. Its not considered a valid (tag) start character in XML. The question is why java would after failing to interpert the fancy unicode character, it would think that the document was starting a new tag. If you interpret F0 9D 96 9F in utf-16, you get:
       U+F09D:   No name (Private Use Area)
    隟   U+969F:   Han ideograph   (CJK Unified Ideographs)
Which theoretically shouldn't cause any problems. (of course the rest of the file wouldn't make sense, and no guarantees that that is where the word boundaries would fall).

I'm thinking this is a bug with the underlying java libraries, as opposed to mwdumper
Comment 8 Platonides 2010-02-12 23:48:54 UTC
(In reply to comment #7)
> >Java internally uses UTF-16
> yes it does, but i think the file is interperted as utf-8, otherwise it
> wouldn't be able to make sense of it at all, as utf-8 and utf-16 look fairly
> different for your average english text (I'm under the impression that utf-16
> is not compatible with ASCII thus nothing would work at all if it was using
> utf-16). 

Right. But it could be overflowing the 16-bit or some other failure.


> >I don't see why it is reading a U+26 (100110).
> 
> The entity references that come after the problematic unicode character is
> where the U+26 (&) comes from.
Interesting. Saving from firefox produced a literal " in the output.

> I'm thinking this is a bug with the underlying java libraries, as opposed to
> mwdumper
I also think so.
Comment 9 Kelson [Emmanuel Engelhart] 2010-02-13 11:11:27 UTC
Created attachment 7115 [details]
 Much more simpler java code that demonstrates error   

Compile with:
gcj -o test --main=Test Test.java

run with the demo XML code as test.xml
Comment 10 Platonides 2010-02-13 16:30:32 UTC
Sun jdk / OpenJdk is not affected.
Comment 11 Kelson [Emmanuel Engelhart] 2010-02-13 16:34:17 UTC
Seems to be a bug in gcj or libgcj. See my email to the java gcc ML:
http://gcc.gnu.org/ml/java/2010-02/msg00000.html
Comment 12 Kelson [Emmanuel Engelhart] 2010-02-15 10:51:28 UTC
In the meantime, Platonides (or anyone having SVN write access), may you please apply the path from my comment #1 https://bugzilla.wikimedia.org /show_bug.cgi?id=22137#c1 ?

Without it, this is impossible to know at which line a SAX parsing error occurs.
Comment 13 Brion Vibber 2010-02-16 17:45:25 UTC
No problem with test case and sample code w/ Apple Java 1.6 on Mac OS X 10.6.2:

java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)

As it's mentioned above as working on OpenJDK and it being a GCJ-specific problem, have marked this as upstream and noted the GCJ relation in the summary. Add upstream bug reference once it gets handled a litte more upstream.
Comment 14 Kelson [Emmanuel Engelhart] 2010-02-22 10:41:49 UTC
GCC Upstream bug:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43138

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links