Last modified: 2011-03-13 18:05:37 UTC
This class (SplitMergeGzipHistoryBlob) compresses large pages much better than the old class ConcatenatedGzipHistoryBlob, and is reasonably fast. The attached program historyblobtest.php includes some speed tests. To use it, export a page with its complete history and call 'php historyblobtest.php pagename.xml'. Unlike ConcatenatedGzipHistoryBlob SplitMergeGzipHistoryBlob does not use serialization. So to create and save an object, use $obj = new SplitMergeGzipHistoryBlob( $compressedBlob ) and $compressedBlob = $obj->getCompressedBlob(); Three states are defined for SplitMergeGzipHistoryBlobs: SM_COMPRESSED, SM_READONLY (uncompressed, but sections and indices not yet extracted) and SM_READWRITE (completely converted into arrays). This is because it would be too much overhead to extract all sections if only one revision is requested. The layout of the flat uncompressed data as used in state SM_READONLY is described in http://meta.wikimedia.org/wiki/User:El/History_compression/Blob_layout
Created attachment 587 [details] HistoryBlob.php
Created attachment 588 [details] test program
it all looks cool, but... what happens in case of hash collision?
If two texts have the same hash, the second text is not stored in the history blob. But this is the same behaviour as in ConcatenatedGzipHistoryBlob. Maybe some very intelligent person is able to compose a different text with the same hash. But noone will notice that. It looks like a normal reversion. Hash collisions between random texts are very unlikely.
Created attachment 591 [details] corrected version of HistoryBlob.php I forgot to set mMetaData in removeItem().
Created attachment 624 [details] HistoryBlob.php.diff
Created attachment 625 [details] SpecialExport.php.diff
Created attachment 626 [details] dumpBackup.php.diff Changes depend on changes in HistoryBlob.php and SpecialExport.php.
Created attachment 627 [details] export-0.2.xsd I don't know if this is a correct XML schema.
Created attachment 628 [details] convertDump (perl script) This is a demo program that converts dumps generated with the --splitrevisions and --usebackrefs options to the old format. It is very slow and the documents generated with it differ somewhat from documents generated with dumpBackup.php (more whitespace and shuffled attributes).
Is there also an import script ready? For importing the dumps back into mysql?
In http://mail.wikipedia.org/pipermail/wikitech-l/2005-May/029298.html Brion wrote: > I still need to finish up an importer script using the Special:Import > framework. I didn't find such a script in CVS yet. When he's finished I'll adapt the script and SpecialImport.php to the new format (provided that my code will be accepted).
No comments yet, haven't had time to follow this in detail, but please keep it up -- more efficient compression ordering would be *real nice* to have for 1.6 (or if it's a clean integration, perhaps a merge to 1.5). Definitely on the 1.6 roadmap.
Created attachment 657 [details] SpecialImport.php.diff MAX_FILE_SIZE = 2000000 is too small for uncompressed page histories, so I also added the possibility to upload gzipped XML files. But now the size of the uncompressed data may exceed the memory_limit. It's probably a bad thing to hold the complete page history in memory. Hmm...
Can you make this a unified diff?
Instead of inflating the file in-memory, it would probably be better to read it from a stream -- there should already be classes for doing that for importDump.php, I think.
Another old patch that's gotten recent comments. :) This seems to combine a new history compression class with some sort of changes to the XML export format, I think mainly to allow marking identical revisions. Assigning to Tim for the blob stuff, if there's anything in here we want to adapt to current stuff.
It's very likely that the xdiff-based solution I implemented will out-perform this one. But if you disagree, feel free to run some benchmarks to compare them.