Last modified: 2011-03-13 18:05:37 UTC
This class (SplitMergeGzipHistoryBlob) compresses large pages much better than
the old class ConcatenatedGzipHistoryBlob, and is reasonably fast. The attached
program historyblobtest.php includes some speed tests. To use it, export a
page with its complete history and call 'php historyblobtest.php pagename.xml'.
Unlike ConcatenatedGzipHistoryBlob SplitMergeGzipHistoryBlob does not use
serialization. So to create and save an object, use
$obj = new SplitMergeGzipHistoryBlob( $compressedBlob )
$compressedBlob = $obj->getCompressedBlob();
Three states are defined for SplitMergeGzipHistoryBlobs: SM_COMPRESSED,
SM_READONLY (uncompressed, but sections and indices not yet extracted)
and SM_READWRITE (completely converted into arrays). This is because
it would be too much overhead to extract all sections if only one revision
is requested. The layout of the flat uncompressed data as used in state
SM_READONLY is described in
Created attachment 587 [details]
Created attachment 588 [details]
it all looks cool, but... what happens in case of hash collision?
If two texts have the same hash, the second text is not stored in the
history blob. But this is the same behaviour as in ConcatenatedGzipHistoryBlob.
Maybe some very intelligent person is able to compose a different text with
the same hash. But noone will notice that. It looks like a normal reversion.
Hash collisions between random texts are very unlikely.
Created attachment 591 [details]
corrected version of HistoryBlob.php
I forgot to set mMetaData in removeItem().
Created attachment 624 [details]
Created attachment 625 [details]
Created attachment 626 [details]
Changes depend on changes in HistoryBlob.php and SpecialExport.php.
Created attachment 627 [details]
I don't know if this is a correct XML schema.
Created attachment 628 [details]
convertDump (perl script)
This is a demo program that converts dumps generated
with the --splitrevisions and --usebackrefs options to the
old format. It is very slow and the documents generated with it
differ somewhat from documents generated with dumpBackup.php
(more whitespace and shuffled attributes).
Is there also an import script ready? For importing the dumps back into mysql?
In http://mail.wikipedia.org/pipermail/wikitech-l/2005-May/029298.html Brion wrote:
> I still need to finish up an importer script using the Special:Import
I didn't find such a script in CVS yet. When he's finished I'll adapt the script
and SpecialImport.php to the new format (provided that my code will be accepted).
No comments yet, haven't had time to follow this in detail, but
please keep it up -- more efficient compression ordering would be
*real nice* to have for 1.6 (or if it's a clean integration, perhaps
a merge to 1.5).
Definitely on the 1.6 roadmap.
Created attachment 657 [details]
MAX_FILE_SIZE = 2000000 is too small for uncompressed page histories, so I also
added the possibility to upload gzipped XML files. But now the size of the
uncompressed data may exceed the memory_limit. It's probably a bad thing to
hold the complete page history in memory. Hmm...
Can you make this a unified diff?
Instead of inflating the file in-memory, it would probably be better to read it from a stream -- there should already be classes for doing that for importDump.php, I think.
Another old patch that's gotten recent comments. :)
This seems to combine a new history compression class with some sort of changes to the XML export format, I think mainly to allow marking identical revisions.
Assigning to Tim for the blob stuff, if there's anything in here we want to adapt to current stuff.
It's very likely that the xdiff-based solution I implemented will out-perform this one. But if you disagree, feel free to run some benchmarks to compare them.