Last modified: 2014-11-17 10:35:44 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T27312, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 25312 - MD5 or SHA1 checksum in stub dumps
MD5 or SHA1 checksum in stub dumps
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal enhancement with 1 vote (vote)
: ---
Assigned To: Ariel T. Glenn
http://stats.wikimedia.org/EN/EditsRe...
: analytics, patch, patch-need-review
: 33221 (view as bug list)
Depends on: 21860
Blocks:
  Show dependency treegraph
 
Reported: 2010-09-25 20:41 UTC by Erik Zachte
Modified: 2014-11-17 10:35 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch adds a new sha1 tag to each revision in XML dump. (1.86 KB, patch)
2011-11-15 23:24 UTC, Diederik van Liere
Details

Description Erik Zachte 2010-09-25 20:41:23 UTC
There is growing audience for revert stats. Nimisz Gautam and Erik Zachte both made scripts to generate revert stats based on comparing revisions in the dumps via MD5 sums. Rob Lanphier expects MD5 can be used for even fancier processing.

Right now the only way to harvest MD5's is by parsing the full archive dumps which takes forever. 

Proposal is to store MD5's in stub dumps for every revision. This would allow monthly refresh of revert stats (see URL above) and regular publication of revert data files for researchers. 

e.g. 

  <page>
    <title>United States Declaration of Independence</title>
    <id>19</id>
    <revision>
      <id>1926607</id>
      <timestamp>2010-06-15T22:06:14Z</timestamp>
      <contributor>
        <username>Innotata</username>
        <id>172490</id>
      </contributor>
      <text id="1894246" />
      <md5>eff7d5dba32b4da32d9a67a519434d3f</md5>
    </revision>
  </page>
Comment 1 Ariel T. Glenn 2010-09-25 21:25:15 UTC
I would like this for another reason; I would like to see it used to compare the stub text revision in the file with the one we get from the db.  This could make sure that the stuff on disk doesn't wind up silently corrupted and then carried forward in its corrupted state forever (or until someone stumbles across the problem).

However... These sums have to be computed at some point, and that's going to slow the dumps down considerably.  For my usage case we would want the md5sum in the revision metadata somewhere (maybe too slow to do that at each page revision save?)  

We would not be able to generate the MD5 sums for the stub dumps in any case; we wouldn't know them in advance of reading the text, unless they were added to the revision table someplace, in which case see the above.

I'm probably missing something, so feel free to get into technical details of what folks want and how it could work in practice.
Comment 2 Rob Lanphier 2010-09-26 03:53:39 UTC
The fancier processing that Erik is referring to is this:
http://www.mediawiki.org/wiki/Pending_Changes_enwiki_trial/Reversion_collapsing

...which really isn't all that specific to Pending Changes.

I'm listing bug 21860 ("Add checksum field to text table; expose it in API") as a blocker for this one, even though it may not necessarily be one.  Adding the MD5 to the stub dumps would be much simpler if it were in the db.
Comment 3 Domas Mituzas 2010-10-24 20:52:32 UTC
I don't see why #21860 is a blocker - if text is being read, calculating checksums is cheap enough.
Storing all that in the database isn't free.
Comment 4 Diederik van Liere 2011-04-19 14:56:25 UTC
We do not need a cryptographic hash (like md5) but we can use a hash such as murmurhash (http://sites.google.com/site/murmurhash/ and http://en.wikipedia.org/wiki/MurmurHash) which seems to be one of the fastest around. There is also a PHP implementation at http://sites.google.com/site/nonunnet/php/php_murmurhash
Comment 5 Roan Kattouw 2011-04-19 19:23:30 UTC
(In reply to comment #4)
> We do not need a cryptographic hash (like md5) but we can use a hash such as
> murmurhash
I did some quick benchmarks, and both the md5() and sha1() PHP functions are very fast, even for multi-megabyte inputs, so speed is not an issue.
Comment 6 Mark A. Hershberger 2011-05-03 18:56:56 UTC
Givng dump bugs to Ariel.
Comment 7 Rob Lanphier 2011-06-09 19:48:42 UTC
SHA1 *might* make more sense than MD5, if only because it may help us in a crazy future where we leverage tools associated with Git or other version control systems (for example, Mercurial uses SHA1 as well).  Not that there's anything planned, but since the choice of hash is somewhat arbitrary otherwise, SHA1 might be slightly preferable.
Comment 8 Aaron Schulz 2011-08-11 21:55:20 UTC
Fields added to tables in r94289.
Comment 9 Diederik van Liere 2011-08-11 21:57:49 UTC
Thanks Aaron! This is a very welcome feature.
Comment 10 MZMcBride 2011-08-15 19:16:09 UTC
r94289 and subsequent revisions reverted by Brion in r94541.
Comment 11 Brion Vibber 2011-08-15 19:41:49 UTC
(In reply to comment #3)
> I don't see why #21860 is a blocker - if text is being read, calculating
> checksums is cheap enough.
> Storing all that in the database isn't free.

When creating a stub dump, we haven't read the text yet -- the job of fetching and inserting the text is being deferred to a later process (textDumpPass) which pulls the text either from a previous dump or from the text table / external storage etc.

So at that point, only data within the 'page' and 'revision' tables, and anything else that can be very cheaply fetched, is available.

A rev_sha1 field that's already been pre-filled out would be usable for creating stub dumps; calculating from text after it's been read would only be usable on the final dumps (or else a second equivalent pass).

Using a separate field for this also gives greater confidence that there was not internal data corruption; if the sha1 is generated from the text that's right next to it in the same file, there's no point -- the client could calculate it as easily and reliably as the server could have, and in neither case will it indicate if the data has been corrupted on the backend.

(In reply to comment #7)
> SHA1 *might* make more sense than MD5, if only because it may help us in a
> crazy future where we leverage tools associated with Git or other version
> control systems (for example, Mercurial uses SHA1 as well).  Not that there's
> anything planned, but since the choice of hash is somewhat arbitrary otherwise,
> SHA1 might be slightly preferable.

I don't think there'd be much chance at integration here really; git's object references are based on SHA-1 checksums, but of the entire object including a header indicating type ('blob' for files) and size prepended.
Comment 12 Ariel T. Glenn 2011-08-15 20:13:37 UTC
Very correct about the data integrity piece, as I mentioned in comment 1. I use rev_len for now but that is not foolproof.  I've seen a number of revisions on other projects that have identical revision lengths (and they are not redirects either but actual content).  We've had serious data corruption in the past, and odds are we'll run into it again for one reason or another.
Comment 13 Daniel Friesen 2011-08-19 23:12:58 UTC
Bug 2939 did look like something that this blocked. Wouldn't checksum revert detection be the way to fix that bug?
Comment 14 MZMcBride 2011-08-20 01:43:06 UTC
(In reply to comment #13)
> Bug 2939 did look like something that this blocked. Wouldn't checksum revert
> detection be the way to fix that bug?

Bug 2939 is about the ability to detect reverts for the purpose of displaying the new messages notification bar. That would rely on the ability to uniquely identify revisions by putting unique identifiers in the database (bug 21860). Putting unique identifiers in the stub dumps (this bug, bug 25312) wouldn't really have anything to do with that.
Comment 15 Diederik van Liere 2011-11-09 23:13:52 UTC
Commit http://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021 adds fields to the tables.
Comment 16 Diederik van Liere 2011-11-15 23:24:45 UTC
Created attachment 9461 [details]
Patch adds a new sha1 tag to each revision in XML dump.

It will write the sha1 hash if the revision row contains this field, else it will write an empty tag. Not sure if that is the best way to do it and if there are any other edge case that I didn't think of then please let me know. Patch also updates export-0.6.xsd.
Comment 17 Ariel T. Glenn 2011-11-16 08:47:48 UTC
I guess that the revision row would always contain the field, whether or not it is populated, since the patch to Export.php should go in at the same time as the schema change.

I would suggest though that we don't provide the hash when the revision has been deleted; in that case we would want to write an empty tag.
Comment 18 Diederik van Liere 2011-11-16 19:05:12 UTC
Hi Ariel, good point! I'll fix it for deleted revisions.
Comment 19 Rob Lanphier 2012-01-31 21:47:33 UTC
Diederik, is this work finished?
Comment 20 Diederik van Liere 2012-01-31 21:51:44 UTC
Yes, I think so. I updated export.php so that it will be exported to the xml files once 1.19 is deployed.
Comment 21 Mark A. Hershberger 2012-02-08 22:28:31 UTC
*** Bug 33221 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links