Last modified: 2013-05-13 14:08:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T45516, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 43516 - MessageCache.php: Could not acquire lock
MessageCache.php: Could not acquire lock
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.21.x
All All
: High critical (vote)
: ---
Assigned To: Tim Starling
: platformeng
: 45335 (view as bug list)
Depends on:
Blocks: 39480
  Show dependency treegraph
 
Reported: 2012-12-30 11:08 UTC by Siebrand Mazeland
Modified: 2013-05-13 14:08 UTC (History)
21 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Siebrand Mazeland 2012-12-30 11:08:34 UTC
Exports of the translatewiki.net product ("repoexport ihris") iHRIS always exit with exceptions. I've run export.php commands for each individual group of the product, and encountered the below issue at least 10 times for 100 groups or so.

This issue has started to arise about a month ago. This is basically blocking exports of the product, as doing them manually and fully controlled isn't really an option.

Stack dump details:

2012-12-30 11:00:32 v22011026445052.yourvserver.net mediawiki-bw_: [3dfb4df5] [no req]   Exception from line 357 of /www/translatewiki.net/w/includes/cache/MessageCache.php: Could not acquire 'mediawiki-bw_:messages:en:status' lock.
#0 /www/translatewiki.net/w/includes/cache/MessageCache.php(707): MessageCache->load('en')
#1 /www/translatewiki.net/w/includes/cache/MessageCache.php(647): MessageCache->getMsgFromNamespace('Gadgets-definit...', 'en')
#2 /www/translatewiki.net/w/includes/Message.php(670): MessageCache->get('gadgets-definit...', true, Object(Language))
#3 /www/translatewiki.net/w/includes/Message.php(543): Message->fetchMessage()
#4 /www/translatewiki.net/w/extensions/Gadgets/Gadgets_body.php(580): Message->exists()
#5 /www/translatewiki.net/w/extensions/Gadgets/Gadgets_body.php(38): Gadget::loadStructuredList()
#6 [internal function]: GadgetHooks::userGetDefaultOptions(Array)
#7 /www/translatewiki.net/w/includes/Hooks.php(255): call_user_func_array('GadgetHooks::us...', Array)
#8 /www/translatewiki.net/w/includes/GlobalFunctions.php(3856): Hooks::run('UserGetDefaultO...', Array)
#9 /www/translatewiki.net/w/includes/User.php(1240): wfRunHooks('UserGetDefaultO...', Array)
#10 /www/translatewiki.net/w/includes/User.php(4155): User::getDefaultOptions()
#11 /www/translatewiki.net/w/includes/User.php(2220): User->loadOptions()
#12 /www/translatewiki.net/w/includes/context/RequestContext.php(284): User->getOption('language')
#13 /www/translatewiki.net/w/includes/StubObject.php(183): RequestContext->getLanguage()
#14 /www/translatewiki.net/w/includes/StubObject.php(128): StubUserLang->_newObject()
#15 /www/translatewiki.net/w/includes/StubObject.php(78): StubObject->_unstub('getCode', 5)
#16 /www/translatewiki.net/w/includes/StubObject.php(176): StubObject->_call('getCode', Array)
#17 /www/translatewiki.net/w/includes/cache/MessageCache.php(627): StubUserLang->__call('getCode', Array)
#18 /www/translatewiki.net/w/includes/cache/MessageCache.php(627): StubUserLang->getCode()
#19 /www/translatewiki.net/w/includes/Message.php(670): MessageCache->get('parentheses', true, Object(StubUserLang))
#20 /www/translatewiki.net/w/includes/Message.php(431): Message->fetchMessage()
#21 /www/translatewiki.net/w/includes/Message.php(534): Message->toString()
#22 /www/translatewiki.net/w/includes/specials/SpecialVersion.php(190): Message->escaped()
#23 /www/translatewiki.net/w/extensions/Translate/ffs/GettextFFS.php(533): SpecialVersion::getVersion()
#24 /www/translatewiki.net/w/extensions/Translate/ffs/GettextFFS.php(406): GettextFFS->getGenerator()
#25 /www/translatewiki.net/w/extensions/Translate/ffs/GettextFFS.php(357): GettextFFS->doGettextHeader(Object(MessageCollection), Array, false)
#26 /www/translatewiki.net/w/extensions/Translate/ffs/FFS.php(240): GettextFFS->writeReal(Object(MessageCollection))
#27 /www/translatewiki.net/w/extensions/Translate/scripts/export.php(242): SimpleFFS->write(Object(MessageCollection))
#28 {main}
Comment 1 Niklas Laxström 2013-02-25 05:45:09 UTC
*** Bug 45335 has been marked as a duplicate of this bug. ***
Comment 2 Niklas Laxström 2013-02-25 05:48:50 UTC
(No work on this bug since the reporting of two months ago), this exception is now often happening in WMF production.
Comment 3 Matthew Flaschen 2013-02-26 09:29:28 UTC
I've seen this happen on Wikimedia Labs (piramido.wmflabs.org and toro.wmflabs.org).  There was a thread about it at:

http://permalink.gmane.org/gmane.org.wikimedia.labs/756

Andrew Bogott hoped to fix it with https://gerrit.wikimedia.org/r/#/c/47680/, but that didn't work, at least not for toro.

There may be multiple bugs with the same symptoms, but I'm adding this here until pointed to another number.
Comment 4 Greg Grossmeier 2013-02-28 19:40:24 UTC
Setting to Critical as it is affecting WMF infrastructure as well now.
Comment 5 Matthew Flaschen 2013-03-01 00:56:10 UTC
This obviously needs to get fixed properly.

But since it's not mentioned above, this can often be worked around by restarting memcached.  e.g.

sudo /etc/init.d/memcached restart

It will recur again though.
Comment 6 Sam Reed (reedy) 2013-03-01 01:10:01 UTC
(In reply to comment #5)
> This obviously needs to get fixed properly.
> 
> But since it's not mentioned above, this can often be worked around by
> restarting memcached.  e.g.
> 
> sudo /etc/init.d/memcached restart
> 
> It will recur again though.

Let me know when you tell ops you've restarted memcached
Comment 7 Matthew Flaschen 2013-03-01 01:15:48 UTC
(In reply to comment #6)
> Let me know when you tell ops you've restarted memcached

I should clarify, that restart command is not intended for WMF production.

It is just a temporary workaround people can use on Labs or their third-party wikis (at their discretion) until the root cause is identified and fixed.
Comment 8 Platonides 2013-03-03 16:35:11 UTC
The problem here may be that when the export exits with an exception, locks are kept. While in bug 45335, it is more likely caused by a cache stampede.

I think this messages lock should be replaced with a PoolCounter. That should fix the issues.
Comment 9 Tim Starling 2013-03-20 06:33:27 UTC
(In reply to comment #8)
> The problem here may be that when the export exits with an exception, locks
> are kept.

If that's the case, then I3066d8db would fix it.
Comment 10 Matthew Flaschen 2013-03-26 21:59:17 UTC
On Wikimedia Labs, when restarting memcached as a workaround it generally gives an error like:

"Restarting memcached: start-stop-daemon: warning: failed to kill nnnnn: No such process"

where nnnnn is the PID from /var/run/memcached.pid

indicating that the process has somehow died.
Comment 11 Tim Starling 2013-04-03 01:05:25 UTC
This is not fixed, we are logging a lot of these.
Comment 12 Tim Starling 2013-04-03 06:48:09 UTC
This started with I811755d4, i.e. Niklas attempting to get someone to fix a bug by making it more severe. Since that change was deployed to WMF, about 523000 exception messages have been delivered to users. Meanwhile, the two people who collaborated on getting that change live (Niklas and Aaron) have not been monitoring the exception rate and have not fixed the underlying issue. Apparently it's now my problem, which is fine, it just would have made my job a bit easier if the people who deliberately introduced this bug could have told me that they did so.

The message cache expiry is one day. When I tested it at an off-peak time (05:24), it took 3.5 seconds to rebuild the commonswiki/ru cache which was previously the subject of an exception flood. The lock wait loop gives up after 10 seconds, and after those 10 seconds, an exception is thrown as long as it has been less than 60 seconds since the lock acquisition. So, what could the problem possibly be? I'll ask my local rocket scientist.
Comment 13 Niklas Laxström 2013-04-03 10:52:32 UTC
What you wrote is true, except that nobody told me to monitor nor how to monitor exception rates. Very little information flows to me what is available on wmf production and what is expected.

Sorry, I don't understand the purpose of your second paragraph. It's probably a good thing that I'm not building rockets.
Comment 14 Tim Starling 2013-04-03 11:15:52 UTC
Proposed fix: Ia145fd90
Comment 15 Ori Livneh 2013-04-04 00:35:08 UTC
(In reply to comment #13)
> What you wrote is true, except that nobody told me to monitor nor how to
> monitor exception rates.

You can SSH to fluorine.eqiad.wmnet and monitor /a/mw-log/exception.log and /a/mw-log/fatal.log. It is a little obscure.
Comment 16 Tim Starling 2013-04-05 05:27:53 UTC
In the old code, all threads finding the cache empty will wait for the lock, taking turns to run loadFromDB(). They don't recheck the shared cache after the lock is acquired. So if it loadFromDB() takes 1s, and there are 100 req/s, then there will be 99 threads queued up waiting for the lock when the first one finishes. Then each will run loadFromDB() in turn, for another 9 seconds. Then the lock wait timeout will be reached, so the other 90 threads will fail to add the status key and will hit the "Could not acquire $statusKey" exception simultaneously. 

This theory agrees pretty well with the pattern of exception floods we see in the logs.
Comment 17 Tim Starling 2013-04-11 03:50:06 UTC
(In reply to comment #13)
> What you wrote is true, except that nobody told me to monitor nor how to
> monitor exception rates. Very little information flows to me what is
> available on wmf production and what is expected.

I didn't expect you to know those things. I expected Aaron to know them.
Comment 18 Andre Klapper 2013-04-11 10:50:08 UTC
(In reply to comment #14)
> Proposed fix: Ia145fd90

Patch got merged yesterday.
Comment 19 Andre Klapper 2013-04-18 17:41:29 UTC
Is it already possible to stay if the patch was effective / can this ticket be closed or is there something left to do?
Comment 20 Siebrand Mazeland 2013-04-18 19:01:38 UTC
I have planned an activity on Saturday that usually triggered this bug repeatedly. I hope I remember to update. In fact, I'll add it to my calendar...
Comment 21 Steven Walling 2013-04-18 19:04:01 UTC
(In reply to comment #19)
> Is it already possible to stay if the patch was effective / can this ticket
> be
> closed or is there something left to do?

I am not seeing these errors on our Labs instances anymore, FWIW.
Comment 22 Siebrand Mazeland 2013-04-21 12:55:59 UTC
Don't see these errors anymore on translatewiki.net where I usually saw them. Thanks for the fix.
Comment 23 Mark Holmquist 2013-04-23 22:54:13 UTC
We're getting a similar error in response to API requests in our Parsoid RT testing infrastructure. Maybe reopen?

http://parsoid.wmflabs.org:8001/result/d74bd4e8003084de3760ad64e006f426e588a19f/ar/Barbie
Comment 24 Mark Holmquist 2013-04-24 14:59:04 UTC
Yeah, reopen.
Comment 25 Andre Klapper 2013-05-06 15:41:23 UTC
Mark: How often does this happen?

Tim: Do you plan to take another look at this now that this ticket got reopened, or shall this be unassigned from you?
Comment 26 Andre Klapper 2013-05-13 10:39:26 UTC
mtraceur: How often does this happen?

Tim: Do you plan to take another look at this now that this ticket got
reopened, or shall this be unassigned from you?
Comment 27 Mark Holmquist 2013-05-13 14:08:25 UTC
Consistently, i.e., I couldn't get it to not happen when testing certain pages.

It looks like that specific page is probably working now, let me see if we have any other instances where there are issues...

No, I guess Parsoid is getting no errors in its RT testing, so I'll close this again.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links