Last modified: 2011-11-29 03:20:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 6172 - data dump has CRC error
data dump has CRC error
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
All All
: High major (vote)
: ---
Assigned To: Nobody - You can work on this!
Depends on:
  Show dependency treegraph
Reported: 2006-06-02 15:14 UTC by Xiaoquan Zhang
Modified: 2011-11-29 03:20 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Xiaoquan Zhang 2006-06-02 15:14:23 UTC
In an attempt to study the incentives to contribute to wikipedia, Feng Zhu from
Harvard Business School and I (MIT Sloan School of Management) wanted to examine
the modification history of the wikipedia entries.  We downloaded the following
data dump file:
and found that it contains CRC errors in it.  We then followed the link to
download a few previous versions of the file, but they all had problems.
Here is the error message returned by bzip2recover:

---- error message ----
bzip2 -t enwiki-20060518-pages-meta-history.xml.bz2 
bzip2: enwiki-20060518-pages-meta-history.xml.bz2: data integrity (CRC) error in

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
---- end of error message ----
Comment 1 Brion Vibber 2006-06-02 18:38:24 UTC
Please provide the following information:

1) md5 checksum of the file (md5sum 2006*.bz2)
2) your version of bzip2 (bzip2 --version)
3) your operating system and version
4) your cpu architecture
Comment 2 Xiaoquan Zhang 2006-06-03 18:36:39 UTC
>1) md5 checksum of the file (md5sum 2006*.bz2)
it is different from the one on the website, but I had no problem while
downloading this file, and I tried it several times.

>2) your version of bzip2 (bzip2 --version)
     1.0.3  15-Feb-2005

>3) your operating system and version
     RedHat Enterprise Server 3

>4) your cpu architecture
     Intel Pentium III (Coppermine) 1GHz
Comment 3 Brion Vibber 2006-06-03 20:13:41 UTC
Looks like your download is corrupt then.

Check in particular that your download program handles 
large files (this is over 33 gigabytes; if your download 
is less than 4 gigabytes then you have a buggy download 
tool or a buggy HTTP proxy).

Note that the 7zip version of this file is smaller and 
faster to download, but still over 5 gigabytes so you need 
to confirm that your download was correct.
Comment 4 Xiaoquan Zhang 2006-06-04 03:04:58 UTC
Thanks for the tip.  I used wget 1.10.2, the file size is correct.  I never had
a problem with wget, but I admit 33GB might be too much for any download program.

I wonder if you can create a special version for us with the content in <text>
tags removed, for our purpose of research, we only need the modification history
(who modified what at what time, etc.)  Thanks, please let us know.

The reason we want to go for bzip2 (instead of 7zip) is that I wrote a perl
parser to read directly from bzip2 files and write a new file without
information in the <text> tags.  I'm not sure if it is feasible with 7zip format.  
Comment 5 Brion Vibber 2006-06-04 05:05:39 UTC
Tried reading from a pipe on stdin? (7zip also decompresses 
about 10 times faster than bzip2.)
Comment 6 Xiaoquan Zhang 2006-06-04 13:42:55 UTC
I'll give it a try, can you leave this open till I download and verify the
Thanks a lot for the help!
Comment 7 Brion Vibber 2007-07-20 14:40:55 UTC
Really old, so gonna go ahead and close.

Note You need to log in before you can comment on or make changes to this bug.