Last modified: 2014-11-17 10:35:06 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T39225, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 37225 - Several history entries for the same content and watchlists showing wrong article size changes


Summary:	Several history entries for the same content and watchlists showing wrong art...

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Database (Other open bugs)
Version:	1.20.x
Hardware:	All All

Importance:	Highest major with 5 votes (vote)
Target Milestone:	---
Assigned To:	Aaron Schulz

URL:
Whiteboard:
Keywords:

Duplicates:	37443 37734 37829 37919 37965 38086 (view as bug list)
Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2012-05-30 17:18 UTC by Platonides
Modified:	2014-11-17 10:35 UTC (History)
CC List:	31 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
SQL logs taken, while the bug occurs (382.88 KB, application/zip) 2012-07-09 23:54 UTC, christian	Details
Add an attachment (proposed patch, testcase, etc.)

Description Platonides 2012-05-30 17:18:11 UTC

http://fr.wikipedia.org/w/index.php?title=Neil_Armstrong&action=history shows two manual (script-assisted) reverts to the same edit, in revisions 79252172 and 79252174.

This shouldn't be possible. Either Le pro du 94 :) should have got an edit conflict, or it should have been silently skipped without getting a history entry.


This could be something serious (such as CronologyProtector not working, or the code no longer being in a transaction) or trivial (such as the js script breaking the edit atomicity).

rev_id 79252174 by Le pro du 94 :) has a rev_parent_id of 79252172 (Superjuju10 edit) which suggests a bug in our code.


The text of 79252172 and 79252174 are equal. After-the-fact editing of one doesn't override the other (as expected). Both show a rev_sha1 of o2dxb378cwej8zv7lx5y1p0wk87oduo.

Comment 1 Platonides 2012-05-30 18:02:49 UTC

I don't think ChronologyProtector is involved here or that the script[1] could cause it by itself. We should dig into EditPage/Article for this race condition.

1- http://fr.wikipedia.org/wiki/MediaWiki:Gadget-LiveRC.js

Comment 2 Platonides 2012-05-30 21:53:35 UTC

We seem to have a good number of these double-entries, usually by the same user (even by bots!).

Find a sample below:
https://en.wikipedia.org/w/index.php?diff=495195923&oldid=495195868
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495187022
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495187432
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495188594
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495188774
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495188789
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495188800
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495190963
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495186485
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495185912
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495183490
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495182487
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495182476
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495182292
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495182160
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495182128
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495181345
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495181176
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495181078
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495181009
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495180628
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495179602
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495176201
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495176177
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495173896
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495172714
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495171495
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495171337
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495171041
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495170956

Comment 3 Tim Starling 2012-05-31 05:17:30 UTC

You can't get an edit conflict with yourself, the subsequent edit just overwrites the previous edit. This allows the user to click "back" after a save to return to the edit form, in order to amend an edit.

Comment 4 Platonides 2012-05-31 08:27:09 UTC

Fair, but you can't create two edits for the same content either (ignoring log actions as moves, protect...).

I tried to reproduce the race with maintenance/edit.php but had no luck so far. A lagged slave might be needed.

Comment 5 Tim Starling 2012-06-01 12:38:18 UTC

This appears to be due to r79518, when essential pre-save consistency checks were modified to use the slave DB instead of the master.

Comment 6 Aaron Schulz 2012-06-05 21:33:11 UTC

I see the line:
    $this->loadPageData( 'fromdbmaster' );
...before the text is loaded in WikiPage::doEdit(). It will the page_latest field from the master to get the revision/text. That text is compared to the incoming text to see if anything changed. In fact, wrapped outside of this, EditPage has a loadPageData( 'forupdate' ) call in internalAttemptSave(). It doesn't seem like it should be easy for this to happen.

There is a "$dbw->begin( __METHOD__ );" right after $change is checked in doEdit(). This will release the FOR UPDATE lock. In theory, another process can save an edit after that and cause a race.

Comment 7 Platonides 2012-06-05 22:43:45 UTC

Those trasnsaction-breaking begins...

But I don't think it matters. You enter into updateRevisionOn(), which is passed $oldid and used as WHERE page_latest=$oldid, so if another process saved an edit in the middle, updateRevisionOn() returns false and it'd rollback.

Comment 8 Richard Guk 2012-06-06 01:49:09 UTC

Ongoing issue since 2012-05-30 at http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Diff.2Fwatchlist_difficulties (and other sections on the same page).

Related to bug 36566 ?

Comment 9 Ningauble 2012-06-07 18:40:17 UTC

I am also seeing this at Wikiquote. The spurious edits are reported on Recent Changes with a size of +<article size>.
I vote to prioritize fixing this because I am a regular Recent Changes watcher and the error gives the strong but misleading appearance of pattern vandalism.

Comment 10 Richard Guk 2012-06-07 21:01:23 UTC

This still seems to be a frequent occurrence, given the continuing reports at enwiki:VPT.

Would a database query help by identifying the precise time on 2012-05-30 that these duplicate entries (appearing like saved null edits) began appearing?

The time could then be compared with the recent server admin activity at http://wikitech.wikimedia.org/view/Server_admin_log#May_30

Comment 11 Aaron Schulz 2012-06-07 22:02:14 UTC

It looks like WikiPage::loadPageData(), which calls loadFromRow(), does not clear mLastRevision, which includes a process cache of the revision text. So, AFAIK, if I do something like:
----
$wp = WikiPage::factory( $title );
$text1 = $wp->getRawText(); // loads text from slave
...wait...
$wp->loadPageData( 'fromdbmaster' ); // loads `page` row from master
$text2 = $wp->getRawText(); // returns same value cached in mLastRevision
----
$text2 will equal $text1 even if an edit was done in the meantime.

WikiPage::doEdit() relies on this not being the case. Otherwise, "$oldid = $this->getLatest();" will get the true latest rev ID but "$oldtext = $this->getRawText();" will be stale, so $changed could be erroneously true, causing a new revision insertion. The updateRevisionOn() race check will not detect any races because $oldid *is* the latest and not some stale value.

Comment 12 Platonides 2012-06-08 14:51:17 UTC

*** Bug 36566 has been marked as a duplicate of this bug. ***

Comment 13 Platonides 2012-06-08 21:30:00 UTC

(In reply to comment #5)
> This appears to be due to r79518, when essential pre-save consistency checks
> were modified to use the slave DB instead of the master.

r79518 was present in 1.18, and the reports started being reported at 30th of May, which point to 1.20wmf4 deployment on 28th of May.

From the common merge base of 1.20wmf3 and 1.20wmf4 (50927) to 1.20wmf4 (5e5773), git log 50927..origin/wmf/1.20wmf4 lists 248 commits to review.

Filtering to changes to Article.php / WikiPage.php, only shows the "Generalizing LinksUpdate" changes (a12ce17, fa132be) as a candidate. Still, I don't see anything that stands as wrong there. They are all changes to link updates done *after* an edit.

Comment 14 Daniel Kinzler 2012-06-09 20:37:22 UTC

As a side note: I think it would be helpful to put all database actions needed when saving a page into SqlUpdate objects, and then perform all updates using a single call to DataUpdate::runUpdates(). This way, all updates are performed in a single transaction, and it would be much easier to understand which updates are needed in which order.

Comment 15 Jarry1250 2012-06-13 00:42:39 UTC

It's worth noting that one of the symptoms of the problem is that rc_old_len is incorrectly filled in as 0 for these edits.

If there's not going to be a full fix for preventing the edits, at least this could resolved; rc_old_len is used for the diff sizes on watchlists and recentchanges.

Comment 16 Tim Starling 2012-06-13 01:32:19 UTC

(In reply to comment #14)
> As a side note: I think it would be helpful to put all database actions needed
> when saving a page into SqlUpdate objects, and then perform all updates using a
> single call to DataUpdate::runUpdates(). This way, all updates are performed in
> a single transaction, and it would be much easier to understand which updates
> are needed in which order.

The links updates need to be done in a separate transaction from other updates.

Comment 17 Dereckson 2012-06-13 18:31:25 UTC

Priority set to highest (This bugs "affects a number of users and impedes their use of the site" on "Wikimedia's servers in production".)

Comment 18 Aaron Schulz 2012-06-13 20:18:18 UTC

(In reply to comment #11)
> It looks like WikiPage::loadPageData(), which calls loadFromRow(), does not
> clear mLastRevision, which includes a process cache of the revision text. So,
> AFAIK, if I do something like:
> ----

https://gerrit.wikimedia.org/r/#/c/10722/1 should fix this.

I did some UDP logging when that staleness occurs and each of the few log entries in fact coincide with duplicate edits so far:
* http://en.wikipedia.org/w/index.php?title=Assassination_of_John_F._Kennedy&diff=497445110&oldid=497445077.
* http://sv.wikipedia.org/w/index.php?title=Serbiens_damlandslag_i_fotboll&diff=16784593&oldid=16784591.

Comment 19 Dereckson 2012-06-13 23:15:19 UTC

Some spaces where the issue is discussed:

Two consecutives days on the French village pump:
* http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Le_Bistro/5_juin_2012#Modification_vide_.3F
* http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Le_Bistro/6_juin_2012#Liste_de_suivi

Already noted on the English Wikipedia village pump:
http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Diff.2Fwatchlist_difficulties

Discussions also occurred a little bit on #wikipedia-fr channel, a strong bit on #wikipedia-fr-liverc (a patrolling channel). They indicated in a nutshell it's slower to patrol, as they've to check plenty of void/duplicated edits.

Comment 20 Rob Lanphier 2012-06-15 18:34:38 UTC

Aaron's fix *should* fix this no later than the 1.20wmf5 deployment (Monday for en.wikipedia.org, and Wednesday for *.wikipedia.org).  It would probably be a good idea to backport this to 1.20wmf4.

Aaron is not 100% confident that the fix he made will fix entirely fix this, which is the point of the logging.  We'll keep an eye on this.

Comment 21 Richard Guk 2012-06-18 19:08:26 UTC

Example null-change edit to [[Belfast]]:

http://en.wikipedia.org/w/index.php?title=Belfast&diff=498122258&oldid=498122204

The above edit was made 2012-06-18T04:03:02‎ (UTC), duplicating the 2012-06-18T04:02:27‎ (+880bytes) of a vandalism edit made 2012-06-18T04:01:58 (-880bytes).

The 04:03:02 edit is currently shown on a watchlist as incorrectly having +880 bytes compared with the 04:02:27 edit (even though the page text is identical).

The zero byte difference is similarly shown incorrectly as +880bytes on [[Special:RecentChangesLinked]]: http://en.wikipedia.org/w/index.php?title=Special:RecentChangesLinked&from=20120618040157&target=Talk%3ABelfast (and presumably on [[Special:RecentChanges]]

The edit is correctly shown on the revision history as having zero byte difference: http://en.wikipedia.org/w/index.php?title=Belfast&action=history

My guess is that the 1.20wmf4 rollout in the past few hours may have corrected the byte-difference in the history page but has failed to correct the differences on watchlists and recentchange lists.

Of course, the above example (being made prior to this evening's rollout) gives no indication of whether wmf4 is now successfully preventing further zero-edits!

Comment 22 Alex Monk 2012-06-18 19:12:56 UTC

(In reply to comment #21)
> My guess is that the 1.20wmf4 rollout in the past few hours may have corrected
> the byte-difference in the history page but has failed to correct the
> differences on watchlists and recentchange lists.
> 
> Of course, the above example (being made prior to this evening's rollout) gives
> no indication of whether wmf4 is now successfully preventing further
> zero-edits!

1.20wmf4 was rolled out to the English Wikipedia on the 4th. Today 1.20wmf5 was rolled out to that wiki. See https://www.mediawiki.org/wiki/MediaWiki_1.20/Roadmap#Schedule_for_the_deployments for more information.

Comment 23 Richard Guk 2012-06-18 19:38:20 UTC

(In reply to comment #22)
> 1.20wmf4 was rolled out to the English Wikipedia on the 4th. Today 1.20wmf5 was
> rolled out to that wiki. ...

Sorry, "wmf4" was my typo (in both places), should read "wmf5".

To clarify: the Belfast null-edit was made prior to today's
wmf5 rollout, but the changelist byte discrepancies are as shown now (after the rollout).

So, whether or not wmf5 has prevented future null-edit saves, it is not listing the byte differences correctly on change lists (other than on the history page).

Comment 24 Platonides 2012-06-18 20:22:49 UTC

The watchlist gets the length diff from the recentchanges table, so it will show the same values (correct or invalid).
Then, the values stored at recentchanges were copied on edit. Release of 1.20wmf5 is expected to fix the bug, but won't fix the old wrong data. So it isn't an evidence of 1.20wmf5 wrongdoing.

Comment 25 Richard Guk 2012-06-18 21:22:31 UTC

I can now confirm that the wmf5 rollout on enwiki has NOT prevented new duplicate entries (saved null edits).

There are 4 consecutive identical edits to [[Croatia national football team]] by [[User:86.182.161.144]], at 2012-06-18T20:45:02‎, 20:45:48‎, 20:46:38 and 20:47:14‎ (UTC), e.g.:
http://en.wikipedia.org/w/index.php?title=Croatia_national_football_team&diff=next&oldid=498236646

The first edit is correctly listed as +11 bytes. The subsequent 3 edits (identical in content to the first) are incorrectly listed at [[Special:RecentChanges]], [[Special:RecentChangesLinked]] and [[Special:Watchlist]] as each adding the entire length of the article (+112,385 bytes, listed 3 times).

The byte changes are correctly listed on the history tab (as +11, 0, 0, 0, respectively).

But the underlying problem is that null changes should not appear at all, so the fix did not work.

Comment 26 Aaron Schulz 2012-06-19 05:58:23 UTC

From the logging, I suspect that there is a parallel bug somewhere that also causes this problem. Only one cause has been fixed.

Comment 27 Richard Guk 2012-06-19 12:56:51 UTC

Even freakier, enwiki:VPT allowed a NON-null edit conflict 6 hours ago.

The history tab gives the following details:

First, an archive bot correctly removed several sections:

* 2012-06-19T06:40:10‎ (UTC) MiszaBot_II (198,571 bytes) (-10,318):
http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&diff=498298525&oldid=498279415

Then, an editor's new comment was saved, but with the archived sections reinserted:

* 2012-06-19T06:40:16‎ Hoo_man (209,033 bytes) (+10,462):
http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&diff=498298534&oldid=498298525

Proof that this is a technical inconsistency, not just a manually-overridden edit-conflict:

* In Special:Watchlist and Special:RecentChangesLinked (and presumably also Special:RecentChanges), the second edit is recorded as having a delta of +209,033bytes - in other words, compares the new edit with zero instead of with the number of bytes in the previous edit.

Fortunately, the RecentChanges byte-delta bug proves that this was caused by MediaWiki and not by the editor! So the minor bug is at least evidencing the major one.

Comment 28 Aaron Schulz 2012-06-21 03:07:53 UTC

What are some examples of duplicate edits on 6/19/12 or later?

Comment 29 Supertoff 2012-06-21 03:55:44 UTC

(In reply to comment #28)
> What are some examples of duplicate edits on 6/19/12 or later?

That ?

http://fr.wikipedia.org/w/index.php?title=New_Edmonton_Arena&action=history

Comment 30 Aaron Schulz 2012-06-21 04:50:15 UTC

(In reply to comment #29)
> (In reply to comment #28)
> > What are some examples of duplicate edits on 6/19/12 or later?
> 
> That ?
> 
> http://fr.wikipedia.org/w/index.php?title=New_Edmonton_Arena&action=history

fr wasn't upgraded until today, en was upgraded Monday.

Comment 31 Supertoff 2012-06-21 04:56:39 UTC

(In reply to comment #30)
> (In reply to comment #29)
> > (In reply to comment #28)
> > > What are some examples of duplicate edits on 6/19/12 or later?
> > 
> > That ?
> > 
> > http://fr.wikipedia.org/w/index.php?title=New_Edmonton_Arena&action=history
> 
> fr wasn't upgraded until today, en was upgraded Monday.

ok

Comment 32 Richard Guk 2012-06-21 13:25:20 UTC

A current example from enwiki:

Duplicate edits of [[Wilhering Abbey]] by [[User:193.171.131.230]]:
* 2012-06-21T12:51:53 UTC (+9 bytes)
* 2012-06-21T12:51:55 UTC (+19,493 bytes) = a saved null edit

http://en.wikipedia.org/w/index.php?title=Wilhering_Abbey&curid=4739545&diff=498658206&oldid=498658203

Comment 33 Aaron Schulz 2012-06-22 03:21:47 UTC

Using ddsh, I can see that all servers are using the same version of WikiPage.php, so it can't be servers with stale code.

Comment 34 Aaron Schulz 2012-06-22 03:26:37 UTC

(In reply to comment #33)
> Using ddsh, I can see that all servers are using the same version of
> WikiPage.php, so it can't be servers with stale code.

Gah, spoke too soon. Servers 190, 206, 266, 281 are not reachable. They are all still in the apache pybal group, and thus are receiving requests. It's possible they are running stale code. Srv281 may also be causing trouble with a swift config change for the same reason.

Comment 35 Aaron Schulz 2012-06-22 03:42:53 UTC

(In reply to comment #34)
> Gah, spoke too soon. Servers 190, 206, 266, 281 are not reachable. They are all
> still in the apache pybal group, and thus are receiving requests. It's possible
> they are running stale code. Srv281 may also be causing trouble with a swift
> config change for the same reason.

These have actually already been disabled and don't seem to be taking traffic from LVS according to Tim. So that looks ruled out as a cause after all.

Comment 36 Aaron Schulz 2012-06-22 16:45:34 UTC

Logging confirms a second bug that can cause this. In doEdit(), "$changed = ( strcmp( $text, $oldtext ) != 0 );" doesn't account for the rtrim() call on the text in the Revision constructor. If the pre-trimmed text is used for comparison but the trimmed text is inserted, you could have duplicate edits. However, none of the logged instances resulted in duplicated edits so far (since the new text had changes other than whitespace in the edits):
2012-06-22 12:39:18 mw19 trwiki: Rev text trimmed for page 1469915                                                                          2012-06-22 13:06:08 mw35 trwiki: Rev text trimmed for page 970567                                                                           2012-06-22 13:45:02 srv288 enwiki: Rev text trimmed for page 26204397                                                                         2012-06-22 15:11:55 mw24 enwiki: Rev text trimmed for page 26204397                                                                         2012-06-22 15:40:52 mw24 enwiki: Rev text trimmed for page 26204397

Comment 37 Liangent 2012-06-22 16:49:24 UTC

*** Bug 37829 has been marked as a duplicate of this bug. ***

Comment 38 Aaron Schulz 2012-06-22 18:43:32 UTC

(In reply to comment #32)
> A current example from enwiki:
> 
> Duplicate edits of [[Wilhering Abbey]] by [[User:193.171.131.230]]:
> * 2012-06-21T12:51:53 UTC (+9 bytes)
> * 2012-06-21T12:51:55 UTC (+19,493 bytes) = a saved null edit
> 
> http://en.wikipedia.org/w/index.php?title=Wilhering_Abbey&curid=4739545&diff=498658206&oldid=498658203

Thanks. Do we have a list of instances that came up (like a user subpage or something)? I added some more logging and would like to see if it corresponds to new problems.

Comment 39 Richard Guk 2012-06-22 19:23:33 UTC

(In reply to comment #38)
> Thanks. Do we have a list of instances that came up (like a user subpage or
> something)? I added some more logging and would like to see if it corresponds
> to new problems.

Intermittent reports continue at http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Diff.2Fwatchlist_difficulties but I think most editors assume that you are already well aware of the problem so aren't bothering to itemise the many occurrences!

More current examples from [[Special:RecentChanges]]:

* [[2012 Arab Nations Cup]] 2012-06-22T18:40:58 and 18:41:06 UTC (null edit listed as +14,713bytes):
http://en.wikipedia.org/w/index.php?title=2012_Arab_Nations_Cup&curid=27527671&diff=498870742&oldid=498870712

* [[Manchester United F.C.]] 2012-06-22T18:44:22 and 18:44:52 UTC (non-null edit listed as +91,463bytes):
http://en.wikipedia.org/w/index.php?title=Manchester_United_F.C.&diff=498871316&oldid=498871227

The second example is unusual because the editor revised his/her own previous edit with a NON-null edit, so it does not superficially appear problematic. But the recentchanges log incorrectly shows a delta equal to the article size, even though the article size was unchanged, so this is presumably the same underlying bug.

Have you identified any software changes that coincided with the initial flurry of reports on 2012-05-30?

Comment 40 Aaron Schulz 2012-06-23 03:00:30 UTC

None of those examples match the logging, so there is clearly a third independent cause of this bug. So the first was fixed already, the second is awaiting reviewing in gerrit (https://gerrit.wikimedia.org/r/#/c/12613/), and the third (and hopefully last) cause is still elusive.

This is starting to feel like turning over a rock...

Comment 41 db [inactive,noenotif] 2012-06-23 16:48:07 UTC

*** Bug 37734 has been marked as a duplicate of this bug. ***

Comment 42 db [inactive,noenotif] 2012-06-23 16:49:26 UTC

*** Bug 37443 has been marked as a duplicate of this bug. ***

Comment 43 Aaron Schulz 2012-06-23 17:16:51 UTC

(In reply to comment #39)
> Have you identified any software changes that coincided with the initial flurry
> of reports on 2012-05-30?

You can check http://wikitech.wikimedia.org/view/Server_admin_log, I didn't see anything too interesting there.

Comment 44 Richard Guk 2012-06-24 00:34:40 UTC

I've no idea which server log items might be relevant, but we could narrow down the timeframe.

If you have not already done so, could you extract the relevant recentchanges entries (or a large sample) spanning the days around 2012-05-30?

The table is purged after 30 days (at least on enwiki), so a database query now would 
be useful to preserve a comprehensive record for future analysis, even if your current focus is on logging within the current code.

For example, the rough SQL below should identify most duplicate or edit-conflicting rows in recentchanges:

* including edit/create entries which have a prior byte length of zero or null

* but excluding possible page creates (where there is no recent prior edit/create)

* and excluding possible unblanked pages (where there exists any recent edit with a new byte length of zero, or there is no recent prior edit).

SELECT trc.* from recentchanges trc, ( SELECT rc_cur_id, MIN(rc_id) AS rc_id_min, MIN(COALESCE(rc_new_len,0)) from recentchanges WHERE rc_new_len>0 AND rc_type<=1 GROUP BY 1 HAVING MIN(COALESCE(rc_new_len,0))>0 ) trcmin WHERE (trc.rc_old_len=0 OR trc.rc_old_len IS NULL) AND trc.rc_type<=1 AND trc.rc_cur_id=trcmin_rc_cur_id AND trc.rc_id>trcmin.rc_id_min

For simplicity, the above query ignores edits which do not have a valid older entry on recentchanges; and ignores all pages which have recently been blanked.

Hope that's helpful; apologies for being presumptious if not! With the 30 day purging time imminent, I thought it best to make a specific suggestion.

Comment 45 Nemo 2012-06-24 23:44:22 UTC

*** Bug 37919 has been marked as a duplicate of this bug. ***

Comment 46 Marcin Cieślak 2012-06-24 23:50:42 UTC

Results of 

http://toolserver.org/~saper/bug37225/query.sh

from plwiki_p are here:

http://toolserver.org/~saper/bug37225/20120624.txt

I suspect bug 37919 describes one specific case of this bug (maybe we need a new title here?)

Comment 47 MZMcBride 2012-06-25 00:02:01 UTC

(In reply to comment #44)
> SELECT trc.* from recentchanges trc, ( SELECT rc_cur_id, MIN(rc_id) AS
> rc_id_min, MIN(COALESCE(rc_new_len,0)) from recentchanges WHERE rc_new_len>0
> AND rc_type<=1 GROUP BY 1 HAVING MIN(COALESCE(rc_new_len,0))>0 ) trcmin WHERE
> (trc.rc_old_len=0 OR trc.rc_old_len IS NULL) AND trc.rc_type<=1 AND
> trc.rc_cur_id=trcmin_rc_cur_id AND trc.rc_id>trcmin.rc_id_min

This query had a minor typo in it ("trcmin_rc_cur_id" --> "trcmin.rc_cur_id"). Here are the results from enwiki_p (warning: 8004 results): <http://p.defau.lt/?Z52AerNoGQ3hKM77IYNa5g>.

Comment 48 Richard Guk 2012-06-25 09:23:38 UTC

Thanks for running those queries and making the data available.

Apologies for the typo. Also, I shouldn't have included "rc_new_len>0" within the WHERE condition of trcmin (it defeats the HAVING condition which should have excluded previously-blanked pages), so the above results contain "background noise" of a few spurious unblanking edits.

For reference, the following SQL should exclude these cases, and is reformatted to exclude some irrelevant columns and revdeleted rows (simplifying where null handling is not needed):
--
SELECT trc.rc_timestamp, trc.rc_id, trc.rc_user, trc.rc_user_text, trc.rc_namespace, trc.rc_title, trc.rc_minor, trc.rc_bot, trc.rc_cur_id, trc.rc_this_oldid, trc.rc_last_oldid, trc.rc_new_len, trc.rc_deleted, trc.rc_logid, trc.rc_log_type, trc.rc_comment FROM recentchanges trc, ( SELECT rc_cur_id, MIN(rc_id) AS rc_id_min, MIN(rc_new_len) from recentchanges WHERE rc_type<=1 GROUP BY 1 HAVING MIN(rc_new_len)>0 ) trcmin WHERE trc.rc_old_len=0 AND trc.rc_type<=1 AND trc.rc_cur_id=trcmin.rc_cur_id AND trc.rc_id>trcmin.rc_id_min AND trc.rc_deleted=0 ORDER BY 1
--
But it's probably not necessary to re-run a query, because the onset timing is clear if we filter out all non-mainspace edits and obvious mainspace blanking reverts.

Oddly, the rc_last_oldid values seem to be correct, even though rc_old_len is always incorrect.

The earliest enwiki mainspace delta discrepancy is at 2012-05-29T08:50:30 UTC:
http://en.wikipedia.org/w/index.php?title=Impeachment_of_Renato_Corona&diff=494919562&oldid=494919551

The first 15 enwiki mainspace discrepancies are:
* 2012-05-29T08:50:30 Impeachment_of_Renato_Corona [non-null delta-discrepancy edit]
* 2012-05-29T08:50:39 Johannes_Brahms
* 2012-05-29T08:50:41 Impeachment_of_Renato_Corona
* 2012-05-29T08:50:43 Johannes_Brahms
* 2012-05-29T08:50:54 Impeachment_of_Renato_Corona
* 2012-05-29T08:51:05 Impeachment_of_Renato_Corona
* 2012-05-29T09:41:51 Uzi
* 2012-05-29T09:41:53 Tuba
* 2012-05-29T09:41:57 Tuba
* 2012-05-29T09:53:09 UEFA_Euro_2012_squads
* 2012-05-29T10:08:57 Sydney_Opera_House
* 2012-05-29T10:17:34 Nicole_Kidman
* 2012-05-29T10:20:04 Confederation_of_African_Football
* 2012-05-29T10:26:51 Diamond_Jubilee_of_Elizabeth_II
* 2012-05-29T10:27:19 Diamond_Jubilee_of_Elizabeth_II

The error rate seems to have stabilised at around 15 an hour, other than a significantly higher rate on 2012-06-20 between approximately 21:00 and 22:00 UTC.

The earliest plwiki mainspace discrepancy is at 2012-05-30T06:35:26 UTC:
http://pl.wikipedia.org/w/index.php?title=Ben_10:_Tajemnica_Omnitrixa&diff=31467920&oldid=31467918

The plwiki discrepancies recur at approximately 3-hour intervals during waking hours so there may be some significance to the 22 hour difference between enwiki onset and plwiki onset.

Cross-checking against the server admin log, could the 2012-05-29 disk space error be relevant? (Perhaps it was not logged until a few hours after data corruption: "16:10: hashar: srv187 and srv188 are out of disk space".)

Hope that's helpful - I don't know much about the innards of MediaWiki so I'm guessing what might be relevant!

Comment 49 Platonides 2012-06-25 15:33:26 UTC

1.20wmf4 was deployed on 2012-05-28, but for testwiki and mediawikiwiki only. enwiki waited until June 4.

srv187 and srv188 were apaches, so they wouldn't affect page revisions.

Perhaps the sync of php-1.20wmf4 the day before broke/uncached something? :/

Comment 50 Rob Lanphier 2012-06-29 18:31:13 UTC

Bumping down priority to "high".  This problem is annoying, but it's not so important as to justify stopping all other dev work to fix this.  Aaron continues to investigate this among his other tasks.

Comment 51 db [inactive,noenotif] 2012-06-30 14:14:46 UTC

*** Bug 37965 has been marked as a duplicate of this bug. ***

Comment 52 db [inactive,noenotif] 2012-06-30 14:16:37 UTC

*** Bug 38086 has been marked as a duplicate of this bug. ***

Comment 53 db [inactive,noenotif] 2012-06-30 14:19:04 UTC

Getting more and more dupes for the same problem is very annoying. Please fix this as soon as possible. Thanks.

Comment 54 lex ein 2012-06-30 20:56:45 UTC

(In reply to comment #50)
> Bumping down priority to "high".  This problem is annoying,

This bug is more than merely annoying, it's needlessly alarming to editors. Bug 38086 - a reported size change of "(+110,766)", when it should be "(0)" seemingly indicates a massive article change (blanking/vandalism/experiments). These usually require immediate action such as protection or blocking. 

If this bug persists, editors will begin to see all high deltas as false alarms, and begin ignoring them. This will allow real vandalism to persist longer.

(In reply to comment #53)
> Getting more and more dupes for the same problem is very annoying.

More "dupes" are appearing because the title of this bug is misleading and does not appear when searching for "watchlist wrong size change" or "wrong delta".

Comment 55 lex ein 2012-07-01 06:06:27 UTC

(In reply to comment #54)
> ... the title of this bug is misleading ...

I see the title was nicely expanded. Thx.

Comment 56 Aaron Schulz 2012-07-06 22:19:17 UTC

I did some more live debug logging.

I see things like:
2012-07-06 22:05:50 mw69 frwiki: Could not load rev 80557024 for Bargen (Berne) [208044].
2012-07-06 22:02:57 mw68 itwiki: Could not load rev 50844198 for Grandecourt [195339].
2012-07-06 21:28:18 mw39 enwiki: Could not load last rev for Dallas Mavericks [72880].
...which usually refer to duplicate edits.

So the "$revision = Revision::newFromPageId( $this->getId(), $latest );" call in loadLastEdit() seems to return false sometimes, causing getRawText() to return false, causing the strcmp( $text, $oldtext ) to detect a change in text. This can lead to duplicate revision, and an RC row with a change size of the whole text (since $oldsize is 0).

I don't know why that happens, or why newFromPageId() would return false there. It falls back the master if the revision is not found, so in theory it shouldn't be slave lag. But looking at the above logs (which give rev ID, page title, and page ID) and inspecting the pages, those revs clearly exist with those parameters, so there is no reason they shouldn't have been found. Normally, something like this points to a bug around slave lag.

In any case, I may be able to stop this problem by just throwing exceptions when getRawText() returns false but getLatest() is not zero.

I also see strange things like:
2012-07-06 21:49:15 srv216 enwiki: Could not find text with ID 0.
...that sounds like its own bug to look at. Sounds like something maybe using Revision::newFromRow() without rev_text_id in the field list and then calling getText().

Comment 57 Liangent 2012-07-07 05:48:01 UTC

(In reply to comment #56)
> I did some more live debug logging.
> 

But why is doEdit() called more than once? per previous comments, this also happens to bot edits.

Comment 58 Ariel T. Glenn 2012-07-09 08:16:09 UTC

What happens when the master can't be reached (in Revision::newFromConds())?  Any chance we could get a null back instead of an exception?

Comment 59 Aaron Schulz 2012-07-09 17:13:26 UTC

(In reply to comment #58)
> What happens when the master can't be reached (in Revision::newFromConds())? 
> Any chance we could get a null back instead of an exception?

Normally, no unless the ignoreErrors() function was called on the DB. Running the sha1 script though, I noticed that sometimes the DB class forgets that the connection was dropped, and I get raw mysql_query warnings. I think this happens when DB errors are caught and then the DB is used further down the line.

Comment 60 christian 2012-07-09 23:51:48 UTC

After some playing around with xmldumps-test, I am able to reliably
reproduce the problem automatically: It's a race condition, that can
be triggered in the following way:

(1) Start with a page, whose last revision is from PersonA.
(2) PersonB starts uploading a new revision of the page.
(3) PersonC starts uploading a new revision of the page.
(4) Request from PersonC finishes
(5) Request from PersonB finishes

Between (2) and (5), PersonB's connection to the master database executes the SQL sequence

[...]
SELECT /* User::loadGroups  */  ug_group  FROM `user_groups`  WHERE ug_user = '1'  
SELECT /* Block::newLoad Root */  ipb_id,ipb_address,ipb_by,ipb_by_text,ipb_reason,ipb_timestamp,ipb_auto,ipb_anon_only,ipb_create_account,ipb_enable_autoblock,ipb_expiry,ipb_deleted,ipb_block_email,ipb_allow_usertalk,ipb_parent_block_id  FROM `ipblocks`  WHERE ipb_address = 'Root'  
SELECT /* WikiPage::pageData Root */  page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len  FROM `page`  WHERE page_namespace = '0' AND page_title = 'PageZ'  LIMIT 1   FOR UPDATE
[...]

If PersonC's request is dealt /between/ this given first and third
SELECT, both revisions (PersonB's and PersonC's) end up in the
database: The oldest version is PersonA's, followed by PersonC's, and
finally PersonB's. In the "Recent Changes", the PersonB's revision
lists its full size as +value.

No database error, connection loss, network problem, etc need to
occur.

Yes, it's only a rough description. But as I will not find time to
investigate the issue further in the next few days, I hope this
description and SQL sequence nevertheless helps others to quickly
reproduce the race condition as well and hunt it down. As the
setup is tedious, I attached some log files. ;-)

Comment 61 christian 2012-07-09 23:54:07 UTC

Created attachment 10836 [details]
SQL logs taken, while the bug occurs

Comment 62 Aaron Schulz 2012-07-10 00:03:16 UTC

Note that an exception is now thrown per comment 56. This should not be occurring anymore. A more ideal solution is still in order.

Comment 63 christian 2012-07-11 17:53:28 UTC

Seems Aaron provided a hotfix while I prepared the logs for uploading
in the previous comments :D

But towards his order of a "more ideal solution" ...
it seems the culprit is the usage of "FOR UPDATE" in (only) some of
the relevant queries. So some queries see values updated by other
transactions, while other queries do not.

Assume we want to edit a page PageZ that is at revision 5. Both B and
C are each about to submit a new revision.

B submits first, and opens a master (say BM) and a slave (say BS)
connection to the database. Both connections are on a transaction.
For both BM and BS, the page's page_latest is obviously 5, and the
highest rev_id in the revision table for the page is obviously 5 as
well.

/Before/ B's loadPageData executes the sql of EditPage.php:1206
  $this->mArticle->loadPageData( 'forupdate' );
C starts submitting her revision. So C also opens a master (say CM)
and a slave (say CS) connection to the database. Both connections are
on a transaction as well.
For both CM and CS, the page's page_latest is obviously 5, and the
highest rev_id in the revision table for the page is obviously 5 as
well.

Now while B's request is handled slowly, C's request zooms right
through. C's revision gets added (using rev_id 6) on CM for the page.

For BM, BS, and CS, the pages page_latest is still 5, while for CM it
is 6. Accordingly for the page's highest rev_id.

C's request is done. The transactions CM and CS are committed.

Now B's request reaches the SQL for EditPage.php:1206
  $this->mArticle->loadPageData( 'forupdate' );
. This SQL statement (for connection BM) is

  SELECT /* WikiPage::pageData Root */
  page_id,page_namespace,page_title,page_restrictions,page_counter,
  page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len
  FROM `page` WHERE page_namespace = '0' AND page_title = 'PageZ'
  LIMIT 1 FOR UPDATE

Although the page's page_latest has been 5 before for BM, this query
now yields a page_latest of 6, due to the "FOR UPDATE" suffix.

But recall that for both BM and BS the page's page_latest is still 5
without "FOR UPDATE", just as their highest rev_id in the revision
table for the page is also still at 5.

So the subsequent Revision::fetchFromConds on BS (with "rev_id = '6'") fails.
Also the subsequent Revision::fetchFromConds on BM (with "rev_id ='6'") fails.

Should we:
- just stick with the hotfix of exception throwing,
- restart our transaction to be able to fetch revs that have been
  committed after our transaction started,
- add more "FOR UPDATE"s to be able to fetch revs that have been
  committed after our transaction started,
- get rid of the "FOR UPDATE"s and detect an edit conflict, or
- something completely different?

Comment 64 Alexandre Emsenhuber [IAlex] 2012-07-11 18:35:09 UTC

So first my apologies to have caused this bug with Gerrit change #6079. I did already mentioned MySQL's "consistent read" feature there, but I did not see that problem. I did choose to use "FOR UPDATE" in EditPage.php to catch edit conflicts before executing WikiPage::doEdit(), but it seems that this is causing more harm than good.

An easy solution is to remove the "FOR UPDATE" and use the already-present conflict detection code in WikiPage::updateRevisionOn(), since it put a condition on the page_latest field with fetched value when updating the database row. So a value mismatch will trigger a transaction rollback and an edit-conflict form.

Comment 65 Liangent 2012-07-12 05:28:59 UTC

It seems the isolation between different edits or actions which creates revision has been removed, and this introduces another (maybe more severe) bug: if two such requests are sent simultaneously, request B may fail with "<?xml version="1.0"?><api servedby="mw71"><error code="internal_api_error_MWException" info="Exception Caught: Could not find text for current revision 21906884." xml:space="preserve" /></api>" and 21906884 is the revid of another edit by request A, which is incomplete when request B is processed.

Comment 66 Platonides 2012-07-12 07:31:26 UTC

That's the exception added by Aaron. It's better to throw such exception than producing this bug.

Comment 67 Aaron Schulz 2012-07-21 18:42:44 UTC

I've fixed DATA_FOR_UPDATE loading in https://gerrit.wikimedia.org/r/#/c/16228/.

I've replaced the 'forupdate' loading with 'fromdbmaster' in EditPage in https://gerrit.wikimedia.org/r/#/c/16229/.

Comment 68 Jarry1250 2012-07-23 08:50:05 UTC

Did we apply a fix retrospectively in the end?

Comment 69 Daniel Kinzler 2012-08-27 13:17:20 UTC

Possibly related (may cause premature commits, though I don't see how it could trigger this exact problem): I80faf2ed

To avoid the issue of premature commits and general transaction confusion, I suggest to merge I8c0426e1 soon (along with the fixes I mentioned in the comments there).

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links

ariel
aschulz4587
b
bugzilla+org.wikimedia
christian
daniel.kinzler
dereckson
duplicatebug
EN.WP.ST47
federicoleva
he7d3r+bugs
hoo
jarry1250
jasper
justincheng12345
krenair
ktc
lexein-w
liangent
marcin.cieslak
masemj
me
ningauble
Phoenixoverride
quentinv57
richardg_uk
rillke
robla
supertoff4516
tstarling
wikibugs