Last modified: 2014-04-15 19:13:33 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T20104, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 18104 - Schema request change so deleted edits are identified by revisionID not timestamp (prevents DIFFs from breaking)
Schema request change so deleted edits are identified by revisionID not times...
Status: NEW
Product: MediaWiki
Classification: Unclassified
Page deletion (Other open bugs)
unspecified
All All
: Low enhancement with 3 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: schema-change
: 23695 (view as bug list)
Depends on:
Blocks: 18493 revdel 28047 37465 SWMT 23489 26122
  Show dependency treegraph
 
Reported: 2009-03-22 13:17 UTC by FT2
Modified: 2014-04-15 19:13 UTC (History)
14 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description FT2 2009-03-22 13:17:46 UTC
At present a revision is usually identified by a revisionID in most functions. Thus a public and visible revision, a suppressed (RevisionDeleted) revision, and an oversighted revision, all are encoded and identified by the revision ID.

Normal deleted revisions are the sole exception - they are identified by a timestamp. This has two problems:

1/ Timestamp (to the precision used to identify a deleted revision: YYYYMMDDHHMMSS) may in some cases not be unique.

2/ Deleted diffs and revisions can't be identified from any link prior to deletion, since upon deletion they switch from being identified by revision ID, to being identified by timestamp. This prevents easy lookup of a diff (eg when a privacy or dispute arises), preventing admins, checkusers and others from using a diff to check up on a matter if one of the revisions in the diff has been deleted, and seems to prevent diffs working at all, other than the simple case of comparing two sequential revisions.


DESIRED CHANGE

1/ All revisions, deleted or visible, to be identified by their revision ID, not timestamp or other identifier. (This may involve a schema change to deleted revisions handling.)

2/ A user who can see the text of both revisions in a diff (eg specified and next, specified and prev, or 2 specified revisions), is always able to view the diff between them; the fact one or both may have been deleted doesn't break this functionality.
Comment 1 FT2 2009-03-22 13:21:02 UTC
(Note:  This is somewhat related to, but not the same as, bug 18068. 

They overlap in that in each case the request involves ability to view deleted diffs, but in this case "traditional deletion" also identifies the diffs in a manner that breaks linkage from diff (revisionID) to deleted diff, and makes it very hard to compare two arbitrary diffs if one or both are "traditionally deleted".
Comment 2 Aaron Schulz 2009-03-22 13:47:13 UTC
Note that many older revs have a NULL ar_rev_id value
Comment 3 FT2 2009-10-25 15:47:55 UTC
See also bug 21279.
Comment 4 Graham87 2009-10-28 05:53:03 UTC
More specifically, any revs in the archive table that were deleted before Wikipedia was upgraded to MediaWiki 1.5 in June 2005, which were never undeleted after the upgrade, will have a nul ar_rev_id value.
Comment 5 FT2 2009-10-28 10:03:49 UTC
Presumably that can be fixed by undelete and redelete then? Or a once-off script that fills that field?
Comment 6 Graham87 2009-10-28 10:21:57 UTC
Yeah, probably. If this is done, they'll most likely get a high rev ID as if they're a new edit ... about 323,000,000 or so. It would be nice if they could get their old rev_ID's from when they weren't deleted, but I don't think that's possible.
Comment 7 FT2 2010-02-05 09:42:08 UTC
OverlordQ and I took a look at this on the toolserver. Some of this may be obvious or well known - I don't know how much MediaWiki stuff from 2005 would be "common knowledge".

The latest enwiki deleted revision with no rev_id is timestamped 20050627053602 (June 27 2005, 5.36 am) as Aaron and Graham say. 511728 deleted revisions have no rev_id. 

(Around 2356 revisions also have an entry with the same rev_id in both current and deleted revisions tables. This is presumably due to old data slippage.)

Deleted revisions from before June 2005 which were restored apparently got allocated a new rev_id. Eg, compare the dates for enwiki revision id's 15700000 (June 14 2005), 15700001 (June 9 2005), 16300000 (May 1 2005), and 17000001 (Dec 8 2004). It doesn't seem to cause problems though.

There appear to have been around ~ 17,739,500 revisions on enwiki prior to the changeover of June 2005. Because rev_ids were reallocated you have to go back and forth by 50 or 100 at a time to get an idea what rev__id was reached at roughly what sort of time. It turns out that there were about 17.74 m enwiki revisions at the changeover.

Oversight and Developer deletions would have been negligible up to that point. So in principle, there were approximately 17.7 m revisions prior to the changeover of which 17,043,322 can be traced to "Live data", leaving 696k revision ids untraced.

The conclusion is that the 511 k of old deleted revisions with rev_id = NULL can be sequenced into the 17.7 m known rev_ids prior to the changeover, and there are 696 k rev_ids of deleted revisions which they map into. (The explanation for the other 185 k isn't clear. Delete/restore activity on old revisions??) 

It looks like all the deleted revisions with a null value can be matched fairly accurately by time order against existing gaps in the current revisions and assigned a suitable rev_id that's currently not taken. It might not be perfect but it'll be close, and allocating a time-sequenced old rev_id is probably helpful for admins and the like.

Deletions are quite interspersed with undeleted revisions so this isn't a job requiring human guesswork. It could be a once-off task suited to a script.

This would at least mean every deleted revision had a rev_id, which is a first step in fixing the problems.
Comment 8 Graham87 2010-02-05 16:04:24 UTC
The 185 k leftover revision ID's are probably due to the fact that the deletion archive was cleared twice, in June 2004 and December 2003. It was created in August 2002, so a negligible number of edits were permanently deleted. See:
http://en.wikipedia.org/wiki/Wikipedia:Viewing_and_restoring_deleted_pages#Deletion_archive

The numbers surprise me a bit. They seem to indicate that about three times as many edits were deleted between June 2004 and June 2005 than in the entire period up to June 2004. Perhaps revision ID numbers were reset or reused at some point; It'd be best to ask Brion or Tim about this. The numbers are particularly surprising because those old deleted revisions would presumably include edits to the Wikipedia sandbox, which was routinely moved to bizarre titles by newbies or outright deleted; a page move of a page with many revisions in MW 1.4 and below was a much bigger disaster than it is now, and move protection wasn't introduced until December 2004, see:
http://en.wikipedia.org/wiki/Wikipedia_talk:Protected_page/Archive2#Protection_against_page_moves

I once rescued some old sandbox revisions that go back to June 2004; the ones before then were deleted and are now irretrievable. I used to believe that there were about 50,000 irretrievable sandbox revisions, but with the numbers you have presented here, that may be an exaggeration. For the old sandbox edits, see:
http://en.wikipedia.org/wiki/Wikipedia:Historical_archive/Sandbox

Can I have an example of some revisions having the same ID in the deletion and regular tables? That is very bizarre.

Out-of-order revision IDs cause problems for diffs; the number of intermediate revisions is misreported (as that function works by rev ID), and since the prev/next links in diffs are also ordered by rev ID, they are also affected. See this diff as an example:
http://en.wikipedia.org/w/index.php?title=Talk:Netherlands&diff=11427366&oldid=229588957

More rarely, revisions have the opposite problem; they have the correct revision ID vbut the wrong date. See bug 2219.
Comment 9 OverlordQ 2010-02-05 16:24:44 UTC
The duplicate revision id's I've opened as bug 22392, I put it in Wikimedia as I wasn't 100% on the product.
Comment 10 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-02-06 23:55:27 UTC
(In reply to comment #7)
> (The
> explanation for the other 185 k isn't clear. Delete/restore activity on old
> revisions??) 

AUTOINCREMENT columns aren't guaranteed to be allocated sequentially; values can be skipped.  In particular, if a transaction inserts a row, then gets rolled back rather than committed, there will be a gap, since the id is assigned at insert time and not at commit time.  Autoincrementing is visible immediately, even across transaction boundaries -- which would be a violation of transactional semantics if id's were guaranteed to be sequential, but they aren't.

(This probably doesn't account for *that* many missing revisions, though.)
Comment 11 John Mark Vandenberg 2010-07-09 00:17:27 UTC
*** Bug 23695 has been marked as a duplicate of this bug. ***
Comment 12 FT2 2010-07-09 04:02:57 UTC
As an interim option, can we at least have "&oldid=" work with Special:Undelete?

Most deleted revisions have a revision id and the field is indexed. The few old deleted revisions that don't have a revision id can easily be given one.

Revid/oldid is universally used everywhere else to identify a revision, except when it comes to deleted revisions. It would be a fairly simple change to have Mediawiki correctly handle links of the form:
http://en.wikipedia.org/w/index.php?title=Special:Undelete&oldid=12345

as an equally valid alternative to the existing:
http://en.wikipedia.org/w/index.php?title=Special:Undelete&target=PAGENAME&timestamp=TIMESTAMP

Advantages:

1/ Whether a revision is deleted or undeleted the revisionid stays the same so it can still be used to by admins to find the revision;

2/ Allows a failback to be added to Mediawiki that if &oldid= is given in a link, and the revision isn't in the current revisions table, it can easily and automatically be searched for in the deleted revisions table instead (and vice versa), so diff links will "dead end" less often.

3/ It's simple to do and probably trivial on effort (if Special:Undelete args include a page+timestamp look up the data by those, if the args are an oldid then look up the data by that in the same table);

4/ Makes it easier to transition in future to using rev_id as the common index field for revisions whether deleted or visible, which is a simplifying direction for Mediawiki. Pagename/timestamp would continue to work so nothing "breaks", but allowing oldid to work will make a future transition easier.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links