Last modified: 2011-01-21 23:30:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3935, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 1935 - Versioned data in backend
Versioned data in backend
Status: RESOLVED WORKSFORME
Product: MediaWiki
Classification: Unclassified
History/Diffs (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-04-20 21:22 UTC by Lapo Luchini
Modified: 2011-01-21 23:30 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Lapo Luchini 2005-04-20 21:22:31 UTC
Last version if the one most accessed, the previous ones are less access as they
are more old.
This was RCS's programmers rationale and it worked for years and years for them,
and for CVS too (which uses RCS format, in fact).

Storing only the last version and then "diff to last" for each old revision is:
a. space-savvy
b. quite efficient as the page viewed most of the times is available with no
diffing and patching at all
c. space-savvy
d. space-savvy
e. space-savvi
f. compatible with any backend, requiring only to add a record with the new data
and only modifying the last record (replacing the full text with the "to latest"
diff), all the old records are left alone as they are

When I'll have finished my university thesis I could help implementing it
(either here, if the idea is liked, or in a different wiki engine, I really feel
"bad" to add a single line to a 1000-lines article and know to have "wasted" 999
lines of space ^_^)
Comment 1 Rowan Collins [IMSoP] 2005-04-20 23:25:47 UTC
As I understand it, old versions are no longer stored "whole" anyway. At the
very least, they are now compressed in batches, and I have "heard" discussion of
making the "text" table (now independent of both article and revision metadata)
manageable by a kind of independent back-end with various storage schemes at its
disposal.

But certainly, this idea is one which has been mentionned before as having
potential merit, and your effort to implement it would I'm sure be welcomed.
Comment 2 Antoine "hashar" Musso (WMF) 2011-01-20 20:11:37 UTC
Quoting Roan in http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/51583

'''
Wikimedia doesn't technically use delta compression. It concatenates a
couple dozen adjacent revisions of the same page and compresses that
(with gzip?), achieving very good compression ratios because there is
a huge amount of duplication in, say, 20 adjacent revisions of
[[Barack Obama]] (small changes to a large page, probably a few
identical versions to due vandalism reverts, etc.). However,
decompressing it just gets you the raw text, so nothing in this
storage system helps generation of diffs. Diff generation is still
done by shelling out to wikidiff2 (a custom C++ diff implementation
that generates diffs with HTML markup like <ins>/<del>) and caching
the result in memcached.

'''

Seems good enough. Closing bug as works for me.
Comment 3 Roan Kattouw 2011-01-21 23:30:18 UTC
(In reply to comment #2)
> Quoting Roan in
> http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/51583
> 
> '''
> Wikimedia doesn't technically use delta compression. It concatenates a
> couple dozen adjacent revisions of the same page and compresses that
> (with gzip?), achieving very good compression ratios because there is
> a huge amount of duplication in, say, 20 adjacent revisions of
> [[Barack Obama]] (small changes to a large page, probably a few
> identical versions to due vandalism reverts, etc.). However,
> decompressing it just gets you the raw text, so nothing in this
> storage system helps generation of diffs. Diff generation is still
> done by shelling out to wikidiff2 (a custom C++ diff implementation
> that generates diffs with HTML markup like <ins>/<del>) and caching
> the result in memcached.
> 
> '''
>
...and I was wrong, see the replies to that post. We actually DO use delta-based storage, almost exactly in the way you propose.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links