Last modified: 2011-09-18 07:22:28 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29064, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 27064 - itwiki-20110130-pages-articles.xml.bz2 is corrupted
itwiki-20110130-pages-articles.xml.bz2 is corrupted
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Ariel T. Glenn
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-01-31 12:35 UTC by Mathieu Poumeyrol
Modified: 2011-09-18 07:22 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Mathieu Poumeyrol 2011-01-31 12:35:42 UTC
$ md5sum itwiki-20110130-pages-articles.xml.bz2 
7eac57c7c521bf6f36e9a5d7ec476562  itwiki-20110130-pages-articles.xml.bz2

which is fine, according to http://dumps.wikimedia.org/itwiki/20110130/itwiki-20110130-md5sums.txt

but...

$ bunzip2 itwiki-20110130-pages-articles.xml.bz2 

bunzip2: Data integrity error when decompressing.
	Input file = itwiki-20110130-pages-articles.xml.bz2, output file = itwiki-20110130-pages-articles.xml

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

bunzip2: Deleting output file itwiki-20110130-pages-articles.xml, if it exists.

$ bunzip2 -tvv itwiki-20110130-pages-articles.xml.bz2 
  itwiki-20110130-pages-articles.xml.bz2: 
    [1: huff+mtf rt+rld]
    [2: huff+mtf rt+rld]
[.... snip ....]
    [2510: huff+mtf rt+rld]
    [2511: huff+mtf rt+rld]
    [2512: huff+mtf data integrity (CRC) error in data

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
Comment 1 Mark A. Hershberger 2011-01-31 17:52:12 UTC
got this, too. thanks for reporting this.
Comment 2 Ariel T. Glenn 2011-02-01 02:06:40 UTC
Rerunning this job from the command line.  It should be done in a couple hours and I'll have a look.  I've saved a copy of the old bad file elsewhere on the off chance that it's useful for comparison.
Comment 3 Ariel T. Glenn 2011-02-01 05:47:35 UTC
The new file looks normal afaict.  Can you check it please?
Comment 4 Mathieu Poumeyrol 2011-02-01 07:00:41 UTC
Indeed, my import script passed the download and unzip stage. Thanks a lot, and good luck with the broken file.
Comment 5 Mathieu Poumeyrol 2011-05-12 07:14:22 UTC
New instance of the issue, this time with 

http://download.wikimedia.org/eswiki/20110511/eswiki-20110511-pages-articles.xml.bz2
Comment 6 Ariel T. Glenn 2011-05-12 10:13:01 UTC
The bzip appears to die partway through once in a while.  I'm going to have to add a check for that.  I've so far failed to duplicate it on my laptop (probably because the files I generate aren't large enough). I'll rerun that step so we have a good file in the meantime.
Comment 7 Mathieu Poumeyrol 2011-05-14 09:29:53 UTC
I confirm new eswiki file is ok.
Comment 8 Ariel T. Glenn 2011-09-18 07:22:28 UTC
Closing.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links