Last modified: 2014-10-27 19:04:10 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T20333, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 18333 - eliminate principal source of duplicated text table rows: unchecked undos


Summary:	eliminate principal source of duplicated text table rows: unchecked undos

Status:	NEW

Product:	MediaWiki
Classification:	Unclassified
Component:	Database (Other open bugs)
Version:	1.15.x
Hardware:	All All

Importance:	Lowest enhancement (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2009-04-03 23:56 UTC by Dan Jacobson
Modified:	2014-10-27 19:04 UTC (History)
CC List:	0 users

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
script to squeeze text table and reset pointers to it (2.83 KB, text/plain) 2009-04-06 22:42 UTC, Dan Jacobson	Details
Add an attachment (proposed patch, testcase, etc.)

Description Dan Jacobson 2009-04-03 23:56:48 UTC

In maintenance/tables.sql it says

CREATE TABLE /*_*/revision (...
  -- Key to text.old_id, where the actual bulk text is stored.
  -- It's possible for multiple revisions to use the same text,
  -- for instance revisions where only metadata is altered
  -- or a rollback to a previous version.
  rev_text_id int unsigned NOT NULL,

OK, it seems currently only the

  -- for instance revisions where only metadata is altered

is taken advantage of. However, regrettably,

  -- or a rollback to a previous version.

was forgotten about.

I.e., currently when an undo on an article is done, a new row is created
in the page table in the database.

However it is very likely that this new row will be identical to a
previous row there for the same article.

All we need to do is double check if this is true, and then the id of
the previous row could be placed in the revision table, instead of the
id for the new row, and we needn't create the new row.

This would eliminate the principal source of duplicated rows in the text
table! I'm not saying that you won't still need to fund raising for more
disks anymore, but it is still a shame to see all that text duplicated
when an identical row sitting right under our nose could be reused.

All we need to do is check if this row = previous row.

Actually when you think about sysop rollbacks and even any edits at all
to an article, there is a significant chance that the text (that would
end up in a row) of this edit is an exact match for that of say that of
one of the last 10 edits.

That way the internals that are manipulating the tables needn't be
especially be told "this is an undo, double check if there is an
identical text row we could reuse".

Instead, when any edit is done, just check back for oh, up to 10 of the
text rows of its old revisions for identical text, and point to that row
if found, instead of making a new row.

We needn't search all over the text table for identical rows, we merely
need to look down our short list of very likely suspects.

Yes, we'll never catch the savings of detecting that blanked page A =
blanked page B, as they are not revisions of the same page, but on the
other hand we needn't search further than out short list of likely
suspects.

I tried to make a patch to implement all this, but I only got as far as
Article.php and then got lost. If you want you can assign this bug to me
and after a few weeks I will figure it out (if I don't end up breaking
something else.)

By the way, here are some views of the duplicated text one can try
$ echo "SELECT old_text FROM text;"|
mysql --default-character-set=binary MyDB -N|
perl -nwle '$h{$_}++;END{for(keys %h){print $h{$_}}}'|
sort -nr|uniq -c
$ echo "SELECT old_text FROM text;"|
mysql --default-character-set=binary MyDB -N|
sort|uniq -c|sort -nr|
perl -C -anwle 'exit if $F[0]==1;/.{0,77}/;print $&;'

Comment 1 Dan Jacobson 2009-04-04 00:35:37 UTC

The above will stop new duplication, and would be a shame not to implement.

But what about all the years and years of current duplication already
existing in one's text table?

Should there be a program in maintenance/ available to squeeze it out?

Should it also be run by update.php? Or just once in a wiki's
lifetime? Or just by interested parties who feel the need?

That program would squeeze out duplicates by:
{for each page {go down its list of revisions making duplicate
pointers point to their first}}, the run purgeOldText.php.

One needn't go to "SHA1 mapping to unbloat the text table" (
http://lists.wikimedia.org/pipermail/wikitech-l/2009-March/042373.html
) extremes.

However, perhaps we needn't restrict our thinking to a per article
paradigm, but instead just consider the whole revisions->text table
mapping. Maybe that would be a simpler and smarter way to do this.
We would thus only involved two tables... (wait, we must consider all
tables that have any mapping to the text table! Also all this must be
done with the wiki locked probably, though it would probably only take
few seconds for a small wiki.)

E.g., running our shell/perl scripts above we find 279 separate
pointers to blank (0 byte, vandalism) article revisions. These could
all be made to point to a single text row, even though they are not of
the same article.

Comment 2 Dan Jacobson 2009-04-06 22:42:00 UTC

Created attachment 5997 [details]
script to squeeze text table and reset pointers to it

The neat thing is for many of our checks, we needn't go beyond the
revisions table, as the size of the text is stored there. So if the
size does not match, there is no need to haul it out of the text table
for a string compare.

OK, I have achieved squeezing _current_ duplicates out of the text
table and resetting their pointers! See attached squeeze_example.txt.

One might squeeze out current duplicates, then sit back and see where
any new ones are arriving from, and then address each case...

Comment 3 Dan Jacobson 2009-04-14 19:13:49 UTC

Thread reference: http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/42847

Comment 4 Andre Klapper 2014-10-27 14:00:53 UTC

(In reply to Dan Jacobson from comment #2)
> Created attachment 5997 [details]
> script to squeeze text table and reset pointers to it

Dan: If you're still interested in getting this into MediaWiki, would you be interested in turning this into a patch in Gerrit? See https://www.mediawiki.org/wiki/Gerrit/Tutorial and https://www.mediawiki.org/wiki/Developer_access

Comment 5 Dan Jacobson 2014-10-27 18:50:12 UTC

Gee that was five years ago.
Looking back boy I must have been a wiz kid.
But now I am not that smart.
So I'll just leave whatever it is up to you guys.

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links