Last modified: 2011-01-21 23:30:18 UTC
Last version if the one most accessed, the previous ones are less access as they are more old. This was RCS's programmers rationale and it worked for years and years for them, and for CVS too (which uses RCS format, in fact). Storing only the last version and then "diff to last" for each old revision is: a. space-savvy b. quite efficient as the page viewed most of the times is available with no diffing and patching at all c. space-savvy d. space-savvy e. space-savvi f. compatible with any backend, requiring only to add a record with the new data and only modifying the last record (replacing the full text with the "to latest" diff), all the old records are left alone as they are When I'll have finished my university thesis I could help implementing it (either here, if the idea is liked, or in a different wiki engine, I really feel "bad" to add a single line to a 1000-lines article and know to have "wasted" 999 lines of space ^_^)
As I understand it, old versions are no longer stored "whole" anyway. At the very least, they are now compressed in batches, and I have "heard" discussion of making the "text" table (now independent of both article and revision metadata) manageable by a kind of independent back-end with various storage schemes at its disposal. But certainly, this idea is one which has been mentionned before as having potential merit, and your effort to implement it would I'm sure be welcomed.
Quoting Roan in http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/51583 ''' Wikimedia doesn't technically use delta compression. It concatenates a couple dozen adjacent revisions of the same page and compresses that (with gzip?), achieving very good compression ratios because there is a huge amount of duplication in, say, 20 adjacent revisions of [[Barack Obama]] (small changes to a large page, probably a few identical versions to due vandalism reverts, etc.). However, decompressing it just gets you the raw text, so nothing in this storage system helps generation of diffs. Diff generation is still done by shelling out to wikidiff2 (a custom C++ diff implementation that generates diffs with HTML markup like <ins>/<del>) and caching the result in memcached. ''' Seems good enough. Closing bug as works for me.
(In reply to comment #2) > Quoting Roan in > http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/51583 > > ''' > Wikimedia doesn't technically use delta compression. It concatenates a > couple dozen adjacent revisions of the same page and compresses that > (with gzip?), achieving very good compression ratios because there is > a huge amount of duplication in, say, 20 adjacent revisions of > [[Barack Obama]] (small changes to a large page, probably a few > identical versions to due vandalism reverts, etc.). However, > decompressing it just gets you the raw text, so nothing in this > storage system helps generation of diffs. Diff generation is still > done by shelling out to wikidiff2 (a custom C++ diff implementation > that generates diffs with HTML markup like <ins>/<del>) and caching > the result in memcached. > > ''' > ...and I was wrong, see the replies to that post. We actually DO use delta-based storage, almost exactly in the way you propose.