Last modified: 2011-10-12 01:16:47 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T5473, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 3473 - 20050909_pages_current.xml.gz causes XML::Parser::Expat to abort processing
20050909_pages_current.xml.gz causes XML::Parser::Expat to abort processing
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.6.x
All All
: Normal blocker (vote)
: ---
Assigned To: Brion Vibber
http://mail.wikipedia.org/pipermail/w...
:
: 3478 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-09-15 16:57 UTC by Tyler Riddle
Modified: 2011-10-12 01:16 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tyler Riddle 2005-09-15 16:57:30 UTC
Hello,

The most recent dump file (20050909_pages_current.xml.gz) causes the Perl Expat module to abort early with the following 
message: reference to invalid character number at line 273541, column 5, byte 24690464.

The data starting at that byte is:

&#xD801;昏]]</text>
    </revision>
  </page>
  <page>
    <title>Barbara Olson</title>
    <id>4195</id>
    <revision>
      <id>20104101</id>
      <timestamp>2005-08-02T08:47:35Z</timestamp>
      <contributor>
        <username>TMC1982</username>

Unfortunately this error causes Expat to throw an exception and no more processing is possible. It should be possible to 
analyze the dump file, remove erroneous entries, and restart processing, but I can't help but feel the dump process should 
not let this happen. 

This can be verified with Parse::MediaWikiDump available via CPAN. 

Tyler Riddle
Comment 1 Antoine "hashar" Musso (WMF) 2005-09-15 17:02:34 UTC
See the URL, a thread opened by Jakob Voss about the problem
and also give a way to manually fix it.

Assigning bug to Brion as he is currently writing the mwdumper.
Comment 2 Antoine "hashar" Musso (WMF) 2005-09-15 17:04:29 UTC
from wikitech-l:

Brion Vibber wrote:
>>>> Now filed as http://bugzilla.ximian.com/show_bug.cgi?id=76095
>>>> Will see about fixing...
>> Have submitted a patch. The next dump should be correct.

Patch accepted into Mono subversion repository. These guys are fast. :)

-- brion vibber (brion @ pobox.com)


Comment 3 Brion Vibber 2005-09-15 21:34:36 UTC
As above, already fixed. Next dump will be correct. (You can filter this dump if you 
need to.)
Comment 4 Brion Vibber 2005-09-17 00:07:42 UTC
*** Bug 3478 has been marked as a duplicate of this bug. ***
Comment 5 peter green 2005-09-22 11:56:11 UTC
what are entities doing in the dump in the first place?
 
surely literal unicode in the wikitext should become literal unicode in the dump
and entities in the wikitext should be escaped when put into the dump so they
won't be changed to literal unicode by the xml parser.
Comment 6 Brion Vibber 2005-09-22 22:13:33 UTC
See the link to the Mono bug above.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links