Last modified: 2011-11-29 03:20:55 UTC
In an attempt to study the incentives to contribute to wikipedia, Feng Zhu from Harvard Business School and I (MIT Sloan School of Management) wanted to examine the modification history of the wikipedia entries. We downloaded the following data dump file: http://download.wikipedia.com/enwiki/20060518/enwiki-20060518-pages-meta-history.xml.bz2 and found that it contains CRC errors in it. We then followed the link to download a few previous versions of the file, but they all had problems. Here is the error message returned by bzip2recover: ---- error message ---- bzip2 -t enwiki-20060518-pages-meta-history.xml.bz2 bzip2: enwiki-20060518-pages-meta-history.xml.bz2: data integrity (CRC) error in data You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. ---- end of error message ----
Please provide the following information: 1) md5 checksum of the file (md5sum 2006*.bz2) 2) your version of bzip2 (bzip2 --version) 3) your operating system and version 4) your cpu architecture
>1) md5 checksum of the file (md5sum 2006*.bz2) d521c59d94852920648565a80f5e1b90 it is different from the one on the website, but I had no problem while downloading this file, and I tried it several times. >2) your version of bzip2 (bzip2 --version) 1.0.3 15-Feb-2005 >3) your operating system and version RedHat Enterprise Server 3 >4) your cpu architecture Intel Pentium III (Coppermine) 1GHz
Looks like your download is corrupt then. Check in particular that your download program handles large files (this is over 33 gigabytes; if your download is less than 4 gigabytes then you have a buggy download tool or a buggy HTTP proxy). Note that the 7zip version of this file is smaller and faster to download, but still over 5 gigabytes so you need to confirm that your download was correct.
Thanks for the tip. I used wget 1.10.2, the file size is correct. I never had a problem with wget, but I admit 33GB might be too much for any download program. I wonder if you can create a special version for us with the content in <text> tags removed, for our purpose of research, we only need the modification history (who modified what at what time, etc.) Thanks, please let us know. The reason we want to go for bzip2 (instead of 7zip) is that I wrote a perl parser to read directly from bzip2 files and write a new file without information in the <text> tags. I'm not sure if it is feasible with 7zip format.
Tried reading from a pipe on stdin? (7zip also decompresses about 10 times faster than bzip2.)
I'll give it a try, can you leave this open till I download and verify the md5checksum? Thanks a lot for the help!
Really old, so gonna go ahead and close.