Last modified: 2011-03-13 18:05:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T4310, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 2310 - new history compression class


Summary:	new history compression class

Status:	RESOLVED WONTFIX

Product:	MediaWiki
Classification:	Unclassified
Component:	Database (Other open bugs)
Version:	1.5.x
Hardware:	PC Linux

Importance:	Lowest enhancement with 3 votes (vote)
Target Milestone:	---
Assigned To:	Tim Starling

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2005-06-03 15:22 UTC by El
Modified:	2011-03-13 18:05 UTC (History)
CC List:	2 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
HistoryBlob.php (18.13 KB, text/plain) 2005-06-03 15:24 UTC, El	Details
test program (5.03 KB, text/plain) 2005-06-03 15:26 UTC, El	Details
corrected version of HistoryBlob.php (18.21 KB, text/plain) 2005-06-04 04:19 UTC, El	Details
HistoryBlob.php.diff (14.01 KB, patch) 2005-06-21 22:12 UTC, El	Details
SpecialExport.php.diff (12.20 KB, patch) 2005-06-21 22:13 UTC, El	Details
dumpBackup.php.diff (2.64 KB, patch) 2005-06-21 22:15 UTC, El	Details
export-0.2.xsd (3.19 KB, text/plain) 2005-06-21 22:21 UTC, El	Details
convertDump (perl script) (674 bytes, text/plain) 2005-06-21 22:50 UTC, El	Details
SpecialImport.php.diff (3.33 KB, patch) 2005-06-30 19:06 UTC, El	Details
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Description El 2005-06-03 15:22:47 UTC

This class (SplitMergeGzipHistoryBlob) compresses large pages much better than
the old class ConcatenatedGzipHistoryBlob, and is reasonably fast. The attached
program historyblobtest.php includes some speed tests. To use it, export a
page with its complete history and call 'php historyblobtest.php pagename.xml'.

Unlike ConcatenatedGzipHistoryBlob SplitMergeGzipHistoryBlob does not use
serialization. So to create and save an object, use

$obj = new SplitMergeGzipHistoryBlob( $compressedBlob )

and

$compressedBlob = $obj->getCompressedBlob();

Three states are defined for SplitMergeGzipHistoryBlobs: SM_COMPRESSED,
SM_READONLY (uncompressed, but sections and indices not yet extracted)
and SM_READWRITE (completely converted into arrays). This is because
it would be too much overhead to extract all sections if only one revision
is requested. The layout of the flat uncompressed data as used in state
SM_READONLY is described in
http://meta.wikimedia.org/wiki/User:El/History_compression/Blob_layout

Comment 1 El 2005-06-03 15:24:46 UTC

Created attachment 587 [details]
HistoryBlob.php

Comment 2 El 2005-06-03 15:26:38 UTC

Created attachment 588 [details]
test program

Comment 3 Domas Mituzas 2005-06-03 15:38:06 UTC

it all looks cool, but... what happens in case of hash collision?

Comment 4 El 2005-06-03 15:49:01 UTC

If two texts have the same hash, the second text is not stored in the
history blob. But this is the same behaviour as in ConcatenatedGzipHistoryBlob.
Maybe some very intelligent person is able to compose a different text with
the same hash. But noone will notice that. It looks like a normal reversion.
Hash collisions between random texts are very unlikely.

Comment 5 El 2005-06-04 04:19:59 UTC

Created attachment 591 [details]
corrected version of HistoryBlob.php

I forgot to set mMetaData in removeItem().

Comment 6 El 2005-06-21 22:12:05 UTC

Created attachment 624 [details]
HistoryBlob.php.diff

Comment 7 El 2005-06-21 22:13:18 UTC

Created attachment 625 [details]
SpecialExport.php.diff

Comment 8 El 2005-06-21 22:15:43 UTC

Created attachment 626 [details]
dumpBackup.php.diff

Changes depend on changes in HistoryBlob.php and SpecialExport.php.

Comment 9 El 2005-06-21 22:21:31 UTC

Created attachment 627 [details]
export-0.2.xsd

I don't know if this is a correct XML schema.

Comment 10 El 2005-06-21 22:50:55 UTC

Created attachment 628 [details]
convertDump (perl script)

This is a demo program that converts dumps generated
with the --splitrevisions and --usebackrefs options to the
old format. It is very slow and the documents generated with it
differ somewhat from documents generated with dumpBackup.php
(more whitespace and shuffled attributes).

Comment 11 JeLuF 2005-06-23 05:58:00 UTC

Is there also an import script ready? For importing the dumps back into mysql?

Comment 12 El 2005-06-23 09:57:48 UTC

In http://mail.wikipedia.org/pipermail/wikitech-l/2005-May/029298.html Brion wrote:

> I still need to finish up an importer script using the Special:Import
> framework.

I didn't find such a script in CVS yet. When he's finished I'll adapt the script
and SpecialImport.php to the new format (provided that my code will be accepted).

Comment 13 Brion Vibber 2005-06-24 01:29:21 UTC

No comments yet, haven't had time to follow this in detail, but 
please keep it up -- more efficient compression ordering would be 
*real nice* to have for 1.6 (or if it's a clean integration, perhaps 
a merge to 1.5).

Definitely on the 1.6 roadmap.

Comment 14 El 2005-06-30 19:06:26 UTC

Created attachment 657 [details]
SpecialImport.php.diff

MAX_FILE_SIZE = 2000000 is too small for uncompressed page histories, so I also
added the possibility to upload gzipped XML files. But now the size of the
uncompressed data may exceed the memory_limit. It's probably a bad thing to
hold the complete page history in memory. Hmm...

Comment 15 Aaron Schulz 2008-11-01 18:53:53 UTC

Can you make this a unified diff?

Comment 16 Brion Vibber 2008-11-01 18:57:36 UTC

Instead of inflating the file in-memory, it would probably be better to read it from a stream -- there should already be classes for doing that for importDump.php, I think.

Comment 17 Brion Vibber 2008-12-30 03:50:21 UTC

Another old patch that's gotten recent comments. :)

This seems to combine a new history compression class with some sort of changes to the XML export format, I think mainly to allow marking identical revisions.

Assigning to Tim for the blob stuff, if there's anything in here we want to adapt to current stuff.

Comment 18 Tim Starling 2009-02-20 14:38:22 UTC

It's very likely that the xdiff-based solution I implemented will out-perform this one. But if you disagree, feel free to run some benchmarks to compare them.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links