Last modified: 2014-11-15 01:21:46 UTC
Today during the deploy, several of the parsoid hosts remained down for quite a long time. Their logs were filled with: {"name":"parsoid","hostname":"wtp1001","pid":14985,"level":40,"logType":"warning","wiki":"enwiktionary ","title":"綉","oldId":26153377,"msg":"Cache MISS: http://10.2.2.29/enwiktionary/%E7%B6%89?oldid=26153 377","longMsg":"Cache MISS:\nhttp://10.2.2.29/enwiktionary/%E7%B6%89?oldid=26153377","time":"2014-11-1 3T21:15:42.711Z","v":0} {"name":"parsoid","hostname":"wtp1001","pid":14984,"level":40,"logType":"warning","wiki":"enwiktionary ","title":"綊","oldId":17538871,"msg":"Cache MISS: http://10.2.2.29/enwiktionary/%E7%B6%8A?oldid=17538 871","longMsg":"Cache MISS:\nhttp://10.2.2.29/enwiktionary/%E7%B6%8A?oldid=17538871","time":"2014-11-1 3T21:15:42.713Z","v":0} [warning][worker][14995] shutting down [warning][worker][16672] shutting down [warning][worker][14987] shutting down [info][master][14981] shutting down, killing workers [warning][worker][14996] shutting down [warning][worker][15004] shutting down [warning][worker][14990] shutting down [warning][worker][14998] shutting down [warning][worker][14988] shutting down [warning][worker][14992] shutting down [warning][worker][14985] shutting down [warning][worker][15011] shutting down [warning][worker][14984] shutting down [warning][worker][15001] shutting down [warning][worker][15007] shutting down {"name":"parsoid","hostname":"wtp1001","pid":15003,"level":30,"logType":"info","wiki":"rowiki","title" :"2012","oldId":8889091,"msg":"completed parsing in 7469 ms","longMsg":"completed parsing in\n7469\nms ","time":"2014-11-13T21:15:43.087Z","v":0} [warning][worker][15003] shutting down [info][master][14981] exiting [info][master][16784] initializing 15 workers [info][worker][16793] loading ... [info][worker][16790] loading ... [info][worker][16788] loading ... [info][worker][16798] loading ... [info][worker][16795] loading ... [info][worker][16801] loading ... [info][worker][16806] loading ... [info][worker][16793] ready on :8000 [error][worker][16793] Port 8000 is already in use. Exiting. Stack: HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.5:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5) [info][worker][16790] ready on :8000 [info][worker][16798] ready on :8000 [info][worker][16795] ready on :8000 [error][worker][16790] Port 8000 is already in use. Exiting. Stack: HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.3:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5) [error][worker][16798] Port 8000 is already in use. Exiting. Stack: HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.7:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5) [info][worker][16801] ready on :8000 [error][worker][16795] Port 8000 is already in use. Exiting. Stack: HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.6:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5) [error][worker][16801] Port 8000 is already in use. Exiting. Stack: HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.9:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5) [info][worker][16806] ready on :8000 [info][worker][16791] loading ... [error][worker][16806] Port 8000 is already in use. Exiting. Stack: HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.11:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5) [info][worker][16787] loading ... [info][worker][16788] ready on :8000 [error][worker][16788] Port 8000 is already in use. Exiting. Stack: HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.2:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5) [info][worker][16799] loading ... [info][worker][16804] loading ... [info][worker][16810] loading ... [info][worker][16808] loading ... [info][worker][16812] loading ... [info][worker][16814] loading ... [info][worker][16791] ready on :8000 [error][worker][16791] Port 8000 is already in use. Exiting. Stack: HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.4:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5) [info][worker][16787] ready on :8000 [error][worker][16787] Port 8000 is already in use. Exiting. Stack: HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.1:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5) [..etc...] Two things to note here: first, logs from the master aren't going through bunyan, and thus not ending up in logstash. There's is some discussion of this on https://gerrit.wikimedia.org/r/171618. Secondly, we need to use SO_REUSEADDR when we open port 800 so that we don't have to wait for the full linger time to bring up the service after a deploy. See http://stackoverflow.com/questions/14388706/socket-options-so-reuseaddr-and-so-reuseport-how-do-they-differ-do-they-mean-t
Filed bug 73396 for the "logs don't make it to logstash" issue.
Node's docs claim that they already use SO_REUSEADDR: http://nodejs.org/api/net.html#net_server_listen_port_host_backlog_callback This makes our EADDRINUSE errors very mysterious. However, the 'cluster' package does mysterious things: http://nodejs.org/api/cluster.html#cluster_how_it_works It's possible that's what's going wrong here, somehow?
Given that we have had this code for a long time, and also given that we have also had a couple deploys post node 0.10 upgrade, and that this is the first deploy after we enabled timeouts, I would say this is related to timeouts (5 mins).
Possibly we're not shutting down cleanly when the service is stopped, and the zombie process continues holding on to the port until the timeout expires.
So, this is the code that we use to cleanly terminate the workers (see api/server.js) cluster.disconnect(function() { logger.log( "info", "exiting" ); process.exit(0); }); This waits for worker to shut down. However, in the common case, the title should get processed fairly quicly and clear the timeout => all the workers should shut down in a timely fashion. I don't think (or at least, I would be surprised if it were) all those cases of hung processes that do indeed take 5 mins and eventually get killed by the timeout. So, one question I now have is if we are holding onto timeouts after request processing and not clearing those out cleanly in all cases.
Well, it was 3-4 of the parsoid hosts which had this problem. So it's possible that these were hung processes which needed to wait for the timeout. But in this case I would expect 'service parsoid stop' not to actually complete until the timeout had finished and the process had actually stopped. The problem was that the service was restarted while the old parsoid was still running (or so it would seem).
Ah, if 3-4 then maybe that is what it is then. Ping gwicke about the service restart part then. He might have a clue what is going on.