Last modified: 2014-10-27 19:04:10 UTC
In maintenance/tables.sql it says CREATE TABLE /*_*/revision (... -- Key to text.old_id, where the actual bulk text is stored. -- It's possible for multiple revisions to use the same text, -- for instance revisions where only metadata is altered -- or a rollback to a previous version. rev_text_id int unsigned NOT NULL, OK, it seems currently only the -- for instance revisions where only metadata is altered is taken advantage of. However, regrettably, -- or a rollback to a previous version. was forgotten about. I.e., currently when an undo on an article is done, a new row is created in the page table in the database. However it is very likely that this new row will be identical to a previous row there for the same article. All we need to do is double check if this is true, and then the id of the previous row could be placed in the revision table, instead of the id for the new row, and we needn't create the new row. This would eliminate the principal source of duplicated rows in the text table! I'm not saying that you won't still need to fund raising for more disks anymore, but it is still a shame to see all that text duplicated when an identical row sitting right under our nose could be reused. All we need to do is check if this row = previous row. Actually when you think about sysop rollbacks and even any edits at all to an article, there is a significant chance that the text (that would end up in a row) of this edit is an exact match for that of say that of one of the last 10 edits. That way the internals that are manipulating the tables needn't be especially be told "this is an undo, double check if there is an identical text row we could reuse". Instead, when any edit is done, just check back for oh, up to 10 of the text rows of its old revisions for identical text, and point to that row if found, instead of making a new row. We needn't search all over the text table for identical rows, we merely need to look down our short list of very likely suspects. Yes, we'll never catch the savings of detecting that blanked page A = blanked page B, as they are not revisions of the same page, but on the other hand we needn't search further than out short list of likely suspects. I tried to make a patch to implement all this, but I only got as far as Article.php and then got lost. If you want you can assign this bug to me and after a few weeks I will figure it out (if I don't end up breaking something else.) By the way, here are some views of the duplicated text one can try $ echo "SELECT old_text FROM text;"| mysql --default-character-set=binary MyDB -N| perl -nwle '$h{$_}++;END{for(keys %h){print $h{$_}}}'| sort -nr|uniq -c $ echo "SELECT old_text FROM text;"| mysql --default-character-set=binary MyDB -N| sort|uniq -c|sort -nr| perl -C -anwle 'exit if $F[0]==1;/.{0,77}/;print $&;'
The above will stop new duplication, and would be a shame not to implement. But what about all the years and years of current duplication already existing in one's text table? Should there be a program in maintenance/ available to squeeze it out? Should it also be run by update.php? Or just once in a wiki's lifetime? Or just by interested parties who feel the need? That program would squeeze out duplicates by: {for each page {go down its list of revisions making duplicate pointers point to their first}}, the run purgeOldText.php. One needn't go to "SHA1 mapping to unbloat the text table" ( http://lists.wikimedia.org/pipermail/wikitech-l/2009-March/042373.html ) extremes. However, perhaps we needn't restrict our thinking to a per article paradigm, but instead just consider the whole revisions->text table mapping. Maybe that would be a simpler and smarter way to do this. We would thus only involved two tables... (wait, we must consider all tables that have any mapping to the text table! Also all this must be done with the wiki locked probably, though it would probably only take few seconds for a small wiki.) E.g., running our shell/perl scripts above we find 279 separate pointers to blank (0 byte, vandalism) article revisions. These could all be made to point to a single text row, even though they are not of the same article.
Created attachment 5997 [details] script to squeeze text table and reset pointers to it The neat thing is for many of our checks, we needn't go beyond the revisions table, as the size of the text is stored there. So if the size does not match, there is no need to haul it out of the text table for a string compare. OK, I have achieved squeezing _current_ duplicates out of the text table and resetting their pointers! See attached squeeze_example.txt. One might squeeze out current duplicates, then sit back and see where any new ones are arriving from, and then address each case...
Thread reference: http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/42847
(In reply to Dan Jacobson from comment #2) > Created attachment 5997 [details] > script to squeeze text table and reset pointers to it Dan: If you're still interested in getting this into MediaWiki, would you be interested in turning this into a patch in Gerrit? See https://www.mediawiki.org/wiki/Gerrit/Tutorial and https://www.mediawiki.org/wiki/Developer_access
Gee that was five years ago. Looking back boy I must have been a wiz kid. But now I am not that smart. So I'll just leave whatever it is up to you guys.