Last modified: 2011-11-29 03:20:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T8172, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 6172 - data dump has CRC error
data dump has CRC error
Status: RESOLVED WORKSFORME
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: High major (vote)
: ---
Assigned To: Nobody - You can work on this!
http://download.wikipedia.com/enwiki/...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-06-02 15:14 UTC by Xiaoquan Zhang
Modified: 2011-11-29 03:20 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Xiaoquan Zhang 2006-06-02 15:14:23 UTC
In an attempt to study the incentives to contribute to wikipedia, Feng Zhu from
Harvard Business School and I (MIT Sloan School of Management) wanted to examine
the modification history of the wikipedia entries.  We downloaded the following
data dump file:
http://download.wikipedia.com/enwiki/20060518/enwiki-20060518-pages-meta-history.xml.bz2
and found that it contains CRC errors in it.  We then followed the link to
download a few previous versions of the file, but they all had problems.
Here is the error message returned by bzip2recover:

---- error message ----
bzip2 -t enwiki-20060518-pages-meta-history.xml.bz2 
bzip2: enwiki-20060518-pages-meta-history.xml.bz2: data integrity (CRC) error in
data

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
---- end of error message ----
Comment 1 Brion Vibber 2006-06-02 18:38:24 UTC
Please provide the following information:

1) md5 checksum of the file (md5sum 2006*.bz2)
2) your version of bzip2 (bzip2 --version)
3) your operating system and version
4) your cpu architecture
Comment 2 Xiaoquan Zhang 2006-06-03 18:36:39 UTC
>1) md5 checksum of the file (md5sum 2006*.bz2)
d521c59d94852920648565a80f5e1b90  
it is different from the one on the website, but I had no problem while
downloading this file, and I tried it several times.

>2) your version of bzip2 (bzip2 --version)
     1.0.3  15-Feb-2005

>3) your operating system and version
     RedHat Enterprise Server 3

>4) your cpu architecture
     Intel Pentium III (Coppermine) 1GHz
Comment 3 Brion Vibber 2006-06-03 20:13:41 UTC
Looks like your download is corrupt then.

Check in particular that your download program handles 
large files (this is over 33 gigabytes; if your download 
is less than 4 gigabytes then you have a buggy download 
tool or a buggy HTTP proxy).

Note that the 7zip version of this file is smaller and 
faster to download, but still over 5 gigabytes so you need 
to confirm that your download was correct.
Comment 4 Xiaoquan Zhang 2006-06-04 03:04:58 UTC
Thanks for the tip.  I used wget 1.10.2, the file size is correct.  I never had
a problem with wget, but I admit 33GB might be too much for any download program.

I wonder if you can create a special version for us with the content in <text>
tags removed, for our purpose of research, we only need the modification history
(who modified what at what time, etc.)  Thanks, please let us know.

The reason we want to go for bzip2 (instead of 7zip) is that I wrote a perl
parser to read directly from bzip2 files and write a new file without
information in the <text> tags.  I'm not sure if it is feasible with 7zip format.  
Comment 5 Brion Vibber 2006-06-04 05:05:39 UTC
Tried reading from a pipe on stdin? (7zip also decompresses 
about 10 times faster than bzip2.)
Comment 6 Xiaoquan Zhang 2006-06-04 13:42:55 UTC
I'll give it a try, can you leave this open till I download and verify the
md5checksum?
Thanks a lot for the help!
Comment 7 Brion Vibber 2007-07-20 14:40:55 UTC
Really old, so gonna go ahead and close.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links