Last modified: 2014-01-31 21:10:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T24624, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 22624 - Corruption of archive text due to deletion in late 2004
Corruption of archive text due to deletion in late 2004
Status: NEW
Product: MediaWiki
Classification: Unclassified
History/Diffs (Other open bugs)
1.4.x
All All
: Low normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-02-23 06:36 UTC by Tim Starling
Modified: 2014-01-31 21:10 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Starling 2010-02-23 06:36:26 UTC
This is a bug I'm tracking down and fixing, I'm putting it here so I have a place for notes and something to refer to.

CGZ compression was first committed in October 2004, r5940. In December 2004, r6640, this bug was discovered and a temporary fix put in place. Apparently nobody submitted it to Bugzilla at the time.

The issue was that the deletion UI was blind to the compression scheme, and was causing CGZ blobs and pointers to be moved into the archive table. Undeletion would move them back. Pointers to deleted rows cannot work and will give you an error message, so the text of these pointers is unreadable. If the whole article was undeleted, the CGZ blob would get a different old_id, which means that the pointers still don't work. 

If the article was partially undeleted, then you could have pointers which point to deleted rows.

However, undeleted CGZ rows would still give you their default text, which left them open to subsequent irreversible corruption by recompressTracked.php, which may have deleted some of these CGZ blobs, replacing them with a pointer to the primary text only.

The subsequent fixes (r6640, r8983) only fixed the text corruption at the source (i.e. deletion). Apparently no script was run to fix corrupted archive rows or undeleted text rows.

Some archive rows even have pointers to external storage, apparently moved in from old/text via the same bug.

The reason this is coming up now is that there are a fair few revisions which are either accessible (CGZ default text), or inaccessible but recoverable (CGZ pointers), which are now at risk of being lost permanently due to recompressTracked.php. 

The basic plan of action is to compile a list of content hashes in affected CGZ blobs, and to match them up with broken pointers by comparing those content hashes.

I may be able to take this opportunity to normalise the entire archive table, by converting archive rows to the MW 1.5+ format, with a non-null ar_text_id, and blank ar_text and ar_flags. This will free up core database space and allow the deleted text to be recompressed.
Comment 1 Platonides 2010-03-18 23:13:53 UTC
(copying dfrom wikitech)
> I may be able to take this opportunity to normalise the entire archive table,
> by converting archive rows to the MW 1.5+ format, with a non-null ar_text_id,
> and blank ar_text and ar_flags. This will free up core database space and allow
> the deleted text to be recompressed.


What are the chances of moving them back to revision and use revdelete for all deletions (removing archive table)?
See bug 18104, bug 21279, bug 18780
Comment 2 Kevin Israel (PleaseStand) 2014-01-30 05:29:19 UTC
trackBlobs.php refers to a normaliseArchiveTable.php script, which I could not find in core or in WikimediaMaintenance. Has this script not yet been written?
Comment 3 Pawan Seerwani 2014-01-31 20:13:31 UTC
Hi,
I am working on related issue ie. Bug 34925.

All I understand is some data is already corrupted in archives tables and before solving Bug 34925, this bug is to be solved.

So I might as well solve this bug first.

But in my repository( which is wikimedia 1.23), none of the following files exist in wikimedia/maintenance folder

1. recompressTracked.php
2. trackBlobs.php
3. normaliseArchiveTable.php

So can someone tell me how do I solve this bug?
Comment 4 Kevin Israel (PleaseStand) 2014-01-31 20:40:39 UTC
(In reply to comment #3)
> All I understand is some data is already corrupted in archives tables and
> before solving Bug 34925, this bug is to be solved.

As stated in comment 0 ("description"), it would perhaps be most efficient to clean up the database corruption while normalizing the archive table, because doing so requires copying data into the text table anyway (and possibly into external storage, if that is where the text should end up).

> So I might as well solve this bug first.
> 
> But in my repository( which is wikimedia 1.23), none of the following files
> exist in wikimedia/maintenance folder
> 
> 1. recompressTracked.php
> 2. trackBlobs.php
> 3. normaliseArchiveTable.php

The first two exist in a subfolder of maintenance -- maintenance/storage. The third is the maintenance script you were trying to write ("textMigration.php"), though with a command-line option for fixing this bug.

> So can someone tell me how do I solve this bug?

This isn't a particularly easy bug to fix.

MediaWiki's text storage subsystem is poorly documented, and there have been various bugs over the years (including this one!) that need to be accounted for.

Testing is a bit tricky. You would have to set up PHP4 in order to install a buggy revision of MediaWiki 1.4 (which is not compatible with PHP5). You would have to 

create, edit, and delete some pages, and run a
Comment 5 Kevin Israel (PleaseStand) 2014-01-31 21:10:20 UTC
> create, edit, and delete some pages, and run a

[Sorry, I accidentally hit "Save Changes" before I was done typing my comment. It continues as:]

specific maintenance script (compressOld.php) prior to page deletion.
Then you would have to switch to PHP5 and upgrade the installation to the master version of MediaWiki.

What the maintenance script will have to do is already stated in comment 0 ("The basic plan of action is to compile a list of content hashes in affected CGZ
blobs, and to match them up with broken pointers by comparing those content
hashes.")

I might as well assign bug 34925 to myself, as I have already spent the several hours necessary to understand how MediaWiki does text storage, and the code I have looks more complete than what has already been posted on Gerrit.

The only reason I didn't already do so was the possibility that Tim Starling might already have this mostly done ("This is a bug I'm tracking down and fixing"). However, if you manage to get the script done before Tim Starling or I do, I would love to take a look at it again.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links