Last modified: 2011-11-29 03:20:55 UTC
In an attempt to study the incentives to contribute to wikipedia, Feng Zhu from
Harvard Business School and I (MIT Sloan School of Management) wanted to examine
the modification history of the wikipedia entries. We downloaded the following
data dump file:
and found that it contains CRC errors in it. We then followed the link to
download a few previous versions of the file, but they all had problems.
Here is the error message returned by bzip2recover:
---- error message ----
bzip2 -t enwiki-20060518-pages-meta-history.xml.bz2
bzip2: enwiki-20060518-pages-meta-history.xml.bz2: data integrity (CRC) error in
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
---- end of error message ----
Please provide the following information:
1) md5 checksum of the file (md5sum 2006*.bz2)
2) your version of bzip2 (bzip2 --version)
3) your operating system and version
4) your cpu architecture
>1) md5 checksum of the file (md5sum 2006*.bz2)
it is different from the one on the website, but I had no problem while
downloading this file, and I tried it several times.
>2) your version of bzip2 (bzip2 --version)
>3) your operating system and version
RedHat Enterprise Server 3
>4) your cpu architecture
Intel Pentium III (Coppermine) 1GHz
Looks like your download is corrupt then.
Check in particular that your download program handles
large files (this is over 33 gigabytes; if your download
is less than 4 gigabytes then you have a buggy download
tool or a buggy HTTP proxy).
Note that the 7zip version of this file is smaller and
faster to download, but still over 5 gigabytes so you need
to confirm that your download was correct.
Thanks for the tip. I used wget 1.10.2, the file size is correct. I never had
a problem with wget, but I admit 33GB might be too much for any download program.
I wonder if you can create a special version for us with the content in <text>
tags removed, for our purpose of research, we only need the modification history
(who modified what at what time, etc.) Thanks, please let us know.
The reason we want to go for bzip2 (instead of 7zip) is that I wrote a perl
parser to read directly from bzip2 files and write a new file without
information in the <text> tags. I'm not sure if it is feasible with 7zip format.
Tried reading from a pipe on stdin? (7zip also decompresses
about 10 times faster than bzip2.)
I'll give it a try, can you leave this open till I download and verify the
Thanks a lot for the help!
Really old, so gonna go ahead and close.