Last modified: 2013-04-22 16:14:17 UTC
Sometimes, varnishhtcpd will stop responding to HTCP requests due to unexplained thread corruption issues. In a recent example, the daemon was logging "Can't call method "accept" on an undefined value at /usr/local/bin/varnishhtcpd line 71" on purge requests. varnishhtcpd spawns worker threads, and apparently, sometimes the workers go on strike, and that's what the picket signs say. ;-) A workaround that Asher proposes is to modify the daemon to kill itself when it gets in that state, which should cause upstart to respawn. This problem was discovered in fixing the HTCP issues documented in the comments on bug 41130 (late December 2012 comments).
[Sorry if slightly off-topic] Could we have some sort of monitoring of if things actually get purged. squid/varnish purging suddenly not working seems to have happened quite a few times in the past (all for different reasons), and we have no monitoring of it, (We don't even have any unit tests on the MW side for it as far as I am aware). This is bad since: *Most people don't have squid/varnish set up in their dev environment, so people don't notice on local test wikis. *The symptoms are gradual, and usually pass unnoticed for some time *With exception of images, the people primarily effected are anons, who are less likely to know how to report the issue.
(In reply to comment #1) Moved "we should monitor HTCP purging' to a separate bug - bug 43449
FYI, more info posted on ops@ by Tim Starling ~6 hours ago: varnishhtcpd daemon (listens on port 4827 for HTCP purges, and converts them to HTTP purges on localhost) deadlocked and stopped working on all upload hosts. Details: "Apparently the worker threads deadlock each other in malloc/realloc. Then the queue overflows and the main thread tries to exit. The main thread closes its HTCP listen socket and then joins in with the deadlock. So it never exits and upstart can't respawn it."
https://gerrit.wikimedia.org/r/#/c/45302/
This should be fixed now. But the CPU usage is very high, so there may be some packet loss.