Last modified: 2013-04-23 17:17:47 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29320, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 27320 - MessageBlobStore::clear() causes scaling problems on multi-server setups with CDB l10ncache
MessageBlobStore::clear() causes scaling problems on multi-server setups with...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
ResourceLoader (Other open bugs)
1.18.x
All All
: Immediate critical (vote)
: ---
Assigned To: Brad Jorsch
: performance
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-02-11 08:10 UTC by Roan Kattouw
Modified: 2013-04-23 17:17 UTC (History)
13 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Roan Kattouw 2011-02-11 08:10:37 UTC
We had to disable MessageBlobStore::clear() on WMF and replace it with a maintenance script to run upon sync, because on multi-server setups where l10ncache is in CDB, LocalisationCache::recache() is run once per server per language, causing the MBS to be cleared lots of times. This led to DB deadlocks and possibly to other performance issues.

I guess the least we can do is offer a $wg variable to disable clear(). A better solution, suggested by Tim, would be to add CacheDependency::getModifiedTime(), add a way to retrieve the maximum mtime from LocalisationCache, and use that in the startup module to conditionally call MessageBlobStore::clear() before retrieving any module timestamps. This would scale because the startup module is cached for 5 minutes.
Comment 1 Tim Starling 2013-04-10 02:19:42 UTC
The maintenance script run upon sync (clearMessageBlobs.php) is regularly causing s1 master overload, for a few minutes after each execution. Increasing priority.
Comment 2 Andre Klapper 2013-04-10 02:50:46 UTC
Related thread on ops mailing list from 05 Apr 2013:
"outage this evening - possible localization updates issue"

Roan: Do you plan to look into this, as it's assigned to you (probably by default at that time)? If not, wondering who could.
Comment 3 Erik Moeller 2013-04-10 07:45:00 UTC
Tim/Roan, do we need to disable localisation updates til this is resolved?
Comment 5 Rob Lanphier 2013-04-10 19:38:29 UTC
Result of conversation on #wikimedia-tech:  Brad is going to take a crack at fixing this tomorrow.  In the meantime, we would like to disable updates so that we're not taking down the site daily.
Comment 6 Gerrit Notification Bot 2013-04-10 20:57:02 UTC
Related URL: https://gerrit.wikimedia.org/r/58604 (Gerrit Change I7ed047a3802c7186eb0c040556022e58b266a2be)
Comment 7 Gerrit Notification Bot 2013-04-11 05:07:38 UTC
Related URL: https://gerrit.wikimedia.org/r/58660 (Gerrit Change I50d366a03af649bc87158dde4516aae1a2c24924)
Comment 8 Gerrit Notification Bot 2013-04-12 17:09:43 UTC
Related URL: https://gerrit.wikimedia.org/r/58909 (Gerrit Change I3b6ae12875f2f323210fdfba36c5c5d9183588e2)
Comment 9 Gerrit Notification Bot 2013-04-12 17:09:47 UTC
Related URL: https://gerrit.wikimedia.org/r/58909 (Gerrit Change I3b6ae12875f2f323210fdfba36c5c5d9183588e2)
Comment 10 Gerrit Notification Bot 2013-04-12 17:10:12 UTC
Related URL: https://gerrit.wikimedia.org/r/58910 (Gerrit Change I72a1557f9c18b845c952dd2e2697d92e8eb71d93)
Comment 11 Gerrit Notification Bot 2013-04-12 17:11:03 UTC
Related URL: https://gerrit.wikimedia.org/r/58911 (Gerrit Change Ic633a7fde8d4a1d9e9326aa5ae52bf1227e8d30f)
Comment 12 Rob Lanphier 2013-04-12 17:31:04 UTC
Brad has a proposed fix for this with Gerrit change #58911.  We plan to leave localization update disabled over the weekend, giving Tim time to review this on his Monday.  If Tim thinks this is worth a shot and is up for deploying it, he can do that.  Otherwise, we'll figure out what to do on Monday in the U.S.
Comment 13 Brad Jorsch 2013-04-12 19:03:44 UTC
There are three patches in Gerrit related to this.

Gerrit change #58909 adds a new script to the WikimediaMaintenance extension. This new script updates the RL message cache directly, instead of wiping it out and relying on client requests to repopulate it.

Gerrit change #58910 adjusts l10nupdate to preserve the timestamps on the l10n cdb files from LocalisationUpdate when copying them into position. This should improve the efficiency of the new script, since it will allow it to skip updating messages in languages that haven't actually changed.

Gerrit change #58911 changes l10nupdate to actually call the new script. Note that 58909 must be deployed to all wikis (so likely 1.22wmf1 and 1.22wmf2) before this is deployed.
Comment 14 Gerrit Notification Bot 2013-04-15 17:29:11 UTC
https://gerrit.wikimedia.org/r/58909 (Gerrit Change I3b6ae12875f2f323210fdfba36c5c5d9183588e2) | change APPROVED and MERGED [by Aaron Schulz]
Comment 15 Andre Klapper 2013-04-16 22:38:34 UTC
All three patches have been merged.
Comment 16 Brad Jorsch 2013-04-16 22:42:48 UTC
Now we're waiting to see how the new code does.
Comment 17 Andre Klapper 2013-04-22 15:38:19 UTC
(In reply to comment #16)
> Now we're waiting to see how the new code does.

anomie: So can we say already if it works (/close this ticket)?  :)
Comment 18 Brad Jorsch 2013-04-22 16:38:23 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > Now we're waiting to see how the new code does.
> 
> anomie: So can we say already if it works (/close this ticket)?  :)

So far, so good. The #wikimedia-operations logs since April 17 aren't showing the icinga notifications about all of the apaches being down around the time of the update run that was a hallmark of the problem before the RL cache purge was disabled on April 11.

I'm inclined to be cautious and wait until the 24th, making it a full week with no issues, but if someone wants to close before then I wouldn't complain.
Comment 19 Rob Lanphier 2013-04-23 17:17:47 UTC
/me takes on Brad's offer to close this.  :-)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links