Last modified: 2013-04-22 16:14:17 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T45448, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 43448 - varnishhtcpd occasionally stops responding to HTCP requests
varnishhtcpd occasionally stops responding to HTCP requests
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: High major (vote)
: ---
Assigned To: Nobody - You can work on this!
: ops
Depends on:
Blocks: 41130
  Show dependency treegraph
 
Reported: 2012-12-26 23:50 UTC by Rob Lanphier
Modified: 2013-04-22 16:14 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Rob Lanphier 2012-12-26 23:50:03 UTC
Sometimes, varnishhtcpd will stop responding to HTCP requests due to unexplained thread corruption issues.  In a recent example, the daemon was logging "Can't call method "accept" on an undefined value at /usr/local/bin/varnishhtcpd line 71" on purge requests.  varnishhtcpd spawns worker threads, and apparently, sometimes the workers go on strike, and that's what the picket signs say.  ;-)  

A workaround that Asher proposes is to modify the daemon to kill itself when it gets in that state, which should cause upstart to respawn.

This problem was discovered in fixing the HTCP issues documented in the comments on bug 41130 (late December 2012 comments).
Comment 1 Bawolff (Brian Wolff) 2012-12-27 00:02:02 UTC
[Sorry if slightly off-topic] Could we have some sort of monitoring of if things actually get purged. squid/varnish purging suddenly not working seems to have happened quite a few times in the past (all for different reasons), and we have no monitoring of it, (We don't even have any unit tests on the MW side for it as far as I am aware).

This is bad since:
*Most people don't have squid/varnish set up in their dev environment, so people don't notice on local test wikis.
*The symptoms are gradual, and usually pass unnoticed for some time
*With exception of images, the people primarily effected are anons, who are less likely to know how to report the issue.
Comment 2 Bawolff (Brian Wolff) 2012-12-27 00:25:52 UTC
(In reply to comment #1)
Moved "we should monitor HTCP purging' to a separate bug - bug 43449
Comment 3 Andre Klapper 2013-01-22 13:29:26 UTC
FYI, more info posted on ops@ by Tim Starling ~6 hours ago:

varnishhtcpd daemon (listens on port 4827 for HTCP purges, and converts them to HTTP purges on localhost) deadlocked and stopped working on all upload hosts. 

Details: "Apparently the worker threads deadlock each other in malloc/realloc. Then the queue overflows and the main thread tries to exit. The main thread closes its HTCP listen socket and then joins in with the deadlock. So it never exits and upstart can't respawn it."
Comment 4 Andre Klapper 2013-01-23 07:11:57 UTC
https://gerrit.wikimedia.org/r/#/c/45302/
Comment 5 Tim Starling 2013-01-23 11:24:08 UTC
This should be fixed now. But the CPU usage is very high, so there may be some packet loss.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links