Last modified: 2013-05-13 14:08:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T45516, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 43516 - MessageCache.php: Could not acquire lock


Summary:	MessageCache.php: Could not acquire lock

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	1.21.x
Hardware:	All All

Importance:	High critical (vote)
Target Milestone:	---
Assigned To:	Tim Starling

URL:
Whiteboard:
Keywords:	platformeng

Duplicates:	45335 (view as bug list)
Depends on:
Blocks:	39480
	Show dependency tree / graph

Reported:	2012-12-30 11:08 UTC by Siebrand Mazeland
Modified:	2013-05-13 14:08 UTC (History)
CC List:	21 users (show)

See Also:	5092
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Siebrand Mazeland 2012-12-30 11:08:34 UTC

Exports of the translatewiki.net product ("repoexport ihris") iHRIS always exit with exceptions. I've run export.php commands for each individual group of the product, and encountered the below issue at least 10 times for 100 groups or so.

This issue has started to arise about a month ago. This is basically blocking exports of the product, as doing them manually and fully controlled isn't really an option.

Stack dump details:

2012-12-30 11:00:32 v22011026445052.yourvserver.net mediawiki-bw_: [3dfb4df5] [no req]   Exception from line 357 of /www/translatewiki.net/w/includes/cache/MessageCache.php: Could not acquire 'mediawiki-bw_:messages:en:status' lock.
#0 /www/translatewiki.net/w/includes/cache/MessageCache.php(707): MessageCache->load('en')
#1 /www/translatewiki.net/w/includes/cache/MessageCache.php(647): MessageCache->getMsgFromNamespace('Gadgets-definit...', 'en')
#2 /www/translatewiki.net/w/includes/Message.php(670): MessageCache->get('gadgets-definit...', true, Object(Language))
#3 /www/translatewiki.net/w/includes/Message.php(543): Message->fetchMessage()
#4 /www/translatewiki.net/w/extensions/Gadgets/Gadgets_body.php(580): Message->exists()
#5 /www/translatewiki.net/w/extensions/Gadgets/Gadgets_body.php(38): Gadget::loadStructuredList()
#6 [internal function]: GadgetHooks::userGetDefaultOptions(Array)
#7 /www/translatewiki.net/w/includes/Hooks.php(255): call_user_func_array('GadgetHooks::us...', Array)
#8 /www/translatewiki.net/w/includes/GlobalFunctions.php(3856): Hooks::run('UserGetDefaultO...', Array)
#9 /www/translatewiki.net/w/includes/User.php(1240): wfRunHooks('UserGetDefaultO...', Array)
#10 /www/translatewiki.net/w/includes/User.php(4155): User::getDefaultOptions()
#11 /www/translatewiki.net/w/includes/User.php(2220): User->loadOptions()
#12 /www/translatewiki.net/w/includes/context/RequestContext.php(284): User->getOption('language')
#13 /www/translatewiki.net/w/includes/StubObject.php(183): RequestContext->getLanguage()
#14 /www/translatewiki.net/w/includes/StubObject.php(128): StubUserLang->_newObject()
#15 /www/translatewiki.net/w/includes/StubObject.php(78): StubObject->_unstub('getCode', 5)
#16 /www/translatewiki.net/w/includes/StubObject.php(176): StubObject->_call('getCode', Array)
#17 /www/translatewiki.net/w/includes/cache/MessageCache.php(627): StubUserLang->__call('getCode', Array)
#18 /www/translatewiki.net/w/includes/cache/MessageCache.php(627): StubUserLang->getCode()
#19 /www/translatewiki.net/w/includes/Message.php(670): MessageCache->get('parentheses', true, Object(StubUserLang))
#20 /www/translatewiki.net/w/includes/Message.php(431): Message->fetchMessage()
#21 /www/translatewiki.net/w/includes/Message.php(534): Message->toString()
#22 /www/translatewiki.net/w/includes/specials/SpecialVersion.php(190): Message->escaped()
#23 /www/translatewiki.net/w/extensions/Translate/ffs/GettextFFS.php(533): SpecialVersion::getVersion()
#24 /www/translatewiki.net/w/extensions/Translate/ffs/GettextFFS.php(406): GettextFFS->getGenerator()
#25 /www/translatewiki.net/w/extensions/Translate/ffs/GettextFFS.php(357): GettextFFS->doGettextHeader(Object(MessageCollection), Array, false)
#26 /www/translatewiki.net/w/extensions/Translate/ffs/FFS.php(240): GettextFFS->writeReal(Object(MessageCollection))
#27 /www/translatewiki.net/w/extensions/Translate/scripts/export.php(242): SimpleFFS->write(Object(MessageCollection))
#28 {main}

Comment 1 Niklas Laxström 2013-02-25 05:45:09 UTC

*** Bug 45335 has been marked as a duplicate of this bug. ***

Comment 2 Niklas Laxström 2013-02-25 05:48:50 UTC

(No work on this bug since the reporting of two months ago), this exception is now often happening in WMF production.

Comment 3 Matthew Flaschen 2013-02-26 09:29:28 UTC

I've seen this happen on Wikimedia Labs (piramido.wmflabs.org and toro.wmflabs.org).  There was a thread about it at:

http://permalink.gmane.org/gmane.org.wikimedia.labs/756

Andrew Bogott hoped to fix it with https://gerrit.wikimedia.org/r/#/c/47680/, but that didn't work, at least not for toro.

There may be multiple bugs with the same symptoms, but I'm adding this here until pointed to another number.

Comment 4 Greg Grossmeier 2013-02-28 19:40:24 UTC

Setting to Critical as it is affecting WMF infrastructure as well now.

Comment 5 Matthew Flaschen 2013-03-01 00:56:10 UTC

This obviously needs to get fixed properly.

But since it's not mentioned above, this can often be worked around by restarting memcached.  e.g.

sudo /etc/init.d/memcached restart

It will recur again though.

Comment 6 Sam Reed (reedy) 2013-03-01 01:10:01 UTC

(In reply to comment #5)
> This obviously needs to get fixed properly.
> 
> But since it's not mentioned above, this can often be worked around by
> restarting memcached.  e.g.
> 
> sudo /etc/init.d/memcached restart
> 
> It will recur again though.

Let me know when you tell ops you've restarted memcached

Comment 7 Matthew Flaschen 2013-03-01 01:15:48 UTC

(In reply to comment #6)
> Let me know when you tell ops you've restarted memcached

I should clarify, that restart command is not intended for WMF production.

It is just a temporary workaround people can use on Labs or their third-party wikis (at their discretion) until the root cause is identified and fixed.

Comment 8 Platonides 2013-03-03 16:35:11 UTC

The problem here may be that when the export exits with an exception, locks are kept. While in bug 45335, it is more likely caused by a cache stampede.

I think this messages lock should be replaced with a PoolCounter. That should fix the issues.

Comment 9 Tim Starling 2013-03-20 06:33:27 UTC

(In reply to comment #8)
> The problem here may be that when the export exits with an exception, locks
> are kept.

If that's the case, then I3066d8db would fix it.

Comment 10 Matthew Flaschen 2013-03-26 21:59:17 UTC

On Wikimedia Labs, when restarting memcached as a workaround it generally gives an error like:

"Restarting memcached: start-stop-daemon: warning: failed to kill nnnnn: No such process"

where nnnnn is the PID from /var/run/memcached.pid

indicating that the process has somehow died.

Comment 11 Tim Starling 2013-04-03 01:05:25 UTC

This is not fixed, we are logging a lot of these.

Comment 12 Tim Starling 2013-04-03 06:48:09 UTC

This started with I811755d4, i.e. Niklas attempting to get someone to fix a bug by making it more severe. Since that change was deployed to WMF, about 523000 exception messages have been delivered to users. Meanwhile, the two people who collaborated on getting that change live (Niklas and Aaron) have not been monitoring the exception rate and have not fixed the underlying issue. Apparently it's now my problem, which is fine, it just would have made my job a bit easier if the people who deliberately introduced this bug could have told me that they did so.

The message cache expiry is one day. When I tested it at an off-peak time (05:24), it took 3.5 seconds to rebuild the commonswiki/ru cache which was previously the subject of an exception flood. The lock wait loop gives up after 10 seconds, and after those 10 seconds, an exception is thrown as long as it has been less than 60 seconds since the lock acquisition. So, what could the problem possibly be? I'll ask my local rocket scientist.

Comment 13 Niklas Laxström 2013-04-03 10:52:32 UTC

What you wrote is true, except that nobody told me to monitor nor how to monitor exception rates. Very little information flows to me what is available on wmf production and what is expected.

Sorry, I don't understand the purpose of your second paragraph. It's probably a good thing that I'm not building rockets.

Comment 14 Tim Starling 2013-04-03 11:15:52 UTC

Proposed fix: Ia145fd90

Comment 15 Ori Livneh 2013-04-04 00:35:08 UTC

(In reply to comment #13)
> What you wrote is true, except that nobody told me to monitor nor how to
> monitor exception rates.

You can SSH to fluorine.eqiad.wmnet and monitor /a/mw-log/exception.log and /a/mw-log/fatal.log. It is a little obscure.

Comment 16 Tim Starling 2013-04-05 05:27:53 UTC

In the old code, all threads finding the cache empty will wait for the lock, taking turns to run loadFromDB(). They don't recheck the shared cache after the lock is acquired. So if it loadFromDB() takes 1s, and there are 100 req/s, then there will be 99 threads queued up waiting for the lock when the first one finishes. Then each will run loadFromDB() in turn, for another 9 seconds. Then the lock wait timeout will be reached, so the other 90 threads will fail to add the status key and will hit the "Could not acquire $statusKey" exception simultaneously. 

This theory agrees pretty well with the pattern of exception floods we see in the logs.

Comment 17 Tim Starling 2013-04-11 03:50:06 UTC

(In reply to comment #13)
> What you wrote is true, except that nobody told me to monitor nor how to
> monitor exception rates. Very little information flows to me what is
> available on wmf production and what is expected.

I didn't expect you to know those things. I expected Aaron to know them.

Comment 18 Andre Klapper 2013-04-11 10:50:08 UTC

(In reply to comment #14)
> Proposed fix: Ia145fd90

Patch got merged yesterday.

Comment 19 Andre Klapper 2013-04-18 17:41:29 UTC

Is it already possible to stay if the patch was effective / can this ticket be closed or is there something left to do?

Comment 20 Siebrand Mazeland 2013-04-18 19:01:38 UTC

I have planned an activity on Saturday that usually triggered this bug repeatedly. I hope I remember to update. In fact, I'll add it to my calendar...

Comment 21 Steven Walling 2013-04-18 19:04:01 UTC

(In reply to comment #19)
> Is it already possible to stay if the patch was effective / can this ticket
> be
> closed or is there something left to do?

I am not seeing these errors on our Labs instances anymore, FWIW.

Comment 22 Siebrand Mazeland 2013-04-21 12:55:59 UTC

Don't see these errors anymore on translatewiki.net where I usually saw them. Thanks for the fix.

Comment 23 Mark Holmquist 2013-04-23 22:54:13 UTC

We're getting a similar error in response to API requests in our Parsoid RT testing infrastructure. Maybe reopen?

http://parsoid.wmflabs.org:8001/result/d74bd4e8003084de3760ad64e006f426e588a19f/ar/Barbie

Comment 24 Mark Holmquist 2013-04-24 14:59:04 UTC

Yeah, reopen.

Comment 25 Andre Klapper 2013-05-06 15:41:23 UTC

Mark: How often does this happen?

Tim: Do you plan to take another look at this now that this ticket got reopened, or shall this be unassigned from you?

Comment 26 Andre Klapper 2013-05-13 10:39:26 UTC

mtraceur: How often does this happen?

Tim: Do you plan to take another look at this now that this ticket got
reopened, or shall this be unassigned from you?

Comment 27 Mark Holmquist 2013-05-13 14:08:25 UTC

Consistently, i.e., I couldn't get it to not happen when testing certain pages.

It looks like that specific page is probably working now, let me see if we have any other instances where there are issues...

No, I guess Parsoid is getting no errors in its RT testing, so I'll close this again.

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links

abogott
afeldman
aklapper
asharma
b
bugzilla+org.wikimedia
cananian
federicoleva
greg
he7d3r+bugs
hoo
mflaschen
mtraceur
niklas.laxstrom
ori
Platonides
robla
siebrand
srik.lak
swalling
wikibugs-l