Last modified: 2011-03-13 18:05:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T4310, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 2310 - new history compression class
new history compression class
Status: RESOLVED WONTFIX
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.5.x
PC Linux
: Lowest enhancement with 3 votes (vote)
: ---
Assigned To: Tim Starling
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-06-03 15:22 UTC by El
Modified: 2011-03-13 18:05 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
HistoryBlob.php (18.13 KB, text/plain)
2005-06-03 15:24 UTC, El
Details
test program (5.03 KB, text/plain)
2005-06-03 15:26 UTC, El
Details
corrected version of HistoryBlob.php (18.21 KB, text/plain)
2005-06-04 04:19 UTC, El
Details
HistoryBlob.php.diff (14.01 KB, patch)
2005-06-21 22:12 UTC, El
Details
SpecialExport.php.diff (12.20 KB, patch)
2005-06-21 22:13 UTC, El
Details
dumpBackup.php.diff (2.64 KB, patch)
2005-06-21 22:15 UTC, El
Details
export-0.2.xsd (3.19 KB, text/plain)
2005-06-21 22:21 UTC, El
Details
convertDump (perl script) (674 bytes, text/plain)
2005-06-21 22:50 UTC, El
Details
SpecialImport.php.diff (3.33 KB, patch)
2005-06-30 19:06 UTC, El
Details

Description El 2005-06-03 15:22:47 UTC
This class (SplitMergeGzipHistoryBlob) compresses large pages much better than
the old class ConcatenatedGzipHistoryBlob, and is reasonably fast. The attached
program historyblobtest.php includes some speed tests. To use it, export a
page with its complete history and call 'php historyblobtest.php pagename.xml'.

Unlike ConcatenatedGzipHistoryBlob SplitMergeGzipHistoryBlob does not use
serialization. So to create and save an object, use

$obj = new SplitMergeGzipHistoryBlob( $compressedBlob )

and

$compressedBlob = $obj->getCompressedBlob();

Three states are defined for SplitMergeGzipHistoryBlobs: SM_COMPRESSED,
SM_READONLY (uncompressed, but sections and indices not yet extracted)
and SM_READWRITE (completely converted into arrays). This is because
it would be too much overhead to extract all sections if only one revision
is requested. The layout of the flat uncompressed data as used in state
SM_READONLY is described in
http://meta.wikimedia.org/wiki/User:El/History_compression/Blob_layout
Comment 1 El 2005-06-03 15:24:46 UTC
Created attachment 587 [details]
HistoryBlob.php
Comment 2 El 2005-06-03 15:26:38 UTC
Created attachment 588 [details]
test program
Comment 3 Domas Mituzas 2005-06-03 15:38:06 UTC
it all looks cool, but... what happens in case of hash collision?
Comment 4 El 2005-06-03 15:49:01 UTC
If two texts have the same hash, the second text is not stored in the
history blob. But this is the same behaviour as in ConcatenatedGzipHistoryBlob.
Maybe some very intelligent person is able to compose a different text with
the same hash. But noone will notice that. It looks like a normal reversion.
Hash collisions between random texts are very unlikely.
Comment 5 El 2005-06-04 04:19:59 UTC
Created attachment 591 [details]
corrected version of HistoryBlob.php

I forgot to set mMetaData in removeItem().
Comment 6 El 2005-06-21 22:12:05 UTC
Created attachment 624 [details]
HistoryBlob.php.diff
Comment 7 El 2005-06-21 22:13:18 UTC
Created attachment 625 [details]
SpecialExport.php.diff
Comment 8 El 2005-06-21 22:15:43 UTC
Created attachment 626 [details]
dumpBackup.php.diff

Changes depend on changes in HistoryBlob.php and SpecialExport.php.
Comment 9 El 2005-06-21 22:21:31 UTC
Created attachment 627 [details]
export-0.2.xsd

I don't know if this is a correct XML schema.
Comment 10 El 2005-06-21 22:50:55 UTC
Created attachment 628 [details]
convertDump (perl script)

This is a demo program that converts dumps generated
with the --splitrevisions and --usebackrefs options to the
old format. It is very slow and the documents generated with it
differ somewhat from documents generated with dumpBackup.php
(more whitespace and shuffled attributes).
Comment 11 JeLuF 2005-06-23 05:58:00 UTC
Is there also an import script ready? For importing the dumps back into mysql?
Comment 12 El 2005-06-23 09:57:48 UTC
In http://mail.wikipedia.org/pipermail/wikitech-l/2005-May/029298.html Brion wrote:

> I still need to finish up an importer script using the Special:Import
> framework.

I didn't find such a script in CVS yet. When he's finished I'll adapt the script
and SpecialImport.php to the new format (provided that my code will be accepted).
Comment 13 Brion Vibber 2005-06-24 01:29:21 UTC
No comments yet, haven't had time to follow this in detail, but 
please keep it up -- more efficient compression ordering would be 
*real nice* to have for 1.6 (or if it's a clean integration, perhaps 
a merge to 1.5).

Definitely on the 1.6 roadmap.
Comment 14 El 2005-06-30 19:06:26 UTC
Created attachment 657 [details]
SpecialImport.php.diff

MAX_FILE_SIZE = 2000000 is too small for uncompressed page histories, so I also
added the possibility to upload gzipped XML files. But now the size of the
uncompressed data may exceed the memory_limit. It's probably a bad thing to
hold the complete page history in memory. Hmm...
Comment 15 Aaron Schulz 2008-11-01 18:53:53 UTC
Can you make this a unified diff?
Comment 16 Brion Vibber 2008-11-01 18:57:36 UTC
Instead of inflating the file in-memory, it would probably be better to read it from a stream -- there should already be classes for doing that for importDump.php, I think.
Comment 17 Brion Vibber 2008-12-30 03:50:21 UTC
Another old patch that's gotten recent comments. :)

This seems to combine a new history compression class with some sort of changes to the XML export format, I think mainly to allow marking identical revisions.

Assigning to Tim for the blob stuff, if there's anything in here we want to adapt to current stuff.
Comment 18 Tim Starling 2009-02-20 14:38:22 UTC
It's very likely that the xdiff-based solution I implemented will out-perform this one. But if you disagree, feel free to run some benchmarks to compare them. 

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links