Last modified: 2014-09-23 08:59:59 UTC
[13:29] <gwicke> we got an early timeout from salt during the parsoid deploy, which seems to be a salt misconfiguration [13:31] <gwicke> I see cmd = ("sudo salt-call -l quiet --out json publish.runner " [13:31] <gwicke> "deploy.restart '{0}','{1}'") [13:32] <gwicke> in /usr/local/bin/service-restart [13:32] <gwicke> on tin [13:34] <gwicke> from http://docs.saltstack.com/ref/cli/salt.html: -t TIMEOUT, default 5 [13:35] <gwicke> I think I got the timeout around 5 seconds [13:36] <gwicke> parsoid needs up to about 90 seconds for a graceful restart
From Ryan's main: TL;DR: Ensure batching doesn't occur by setting the batch value to 100%: service-restart --repo 'parsoid/deploy' --batch='100%' Long version: I spent some time tracking this down tonight. From the master the following commands work reliably: 1: salt -b '10%' -G 'deployment_target:parsoid' service.restart parsoid 2: salt -b '10%' -G 'deployment_target:parsoid' deploy.restart 'parsoid/deploy' 3: salt-run deploy.restart 'parsoid/deploy' '10%' #2 is a wrapper function for service.restart that maps a repo to a service, then restarts it (it has other future uses that make it necessary as well). #3 is a runner that can be easily referenced from peers like tin to make handling security easier, it basically just calls #2. From tin we call: sudo salt-call -l quiet --out json publish.runner deploy.restart 'parsoid/deploy','10%' This calls #3 on the master via salt's publication system. The problem here is that publish.runner has a default timeout. When that timeout is reached, the runner stops executing. Part of the issue is that it times out at all. The other part of this issue is that this command has no timeout argument: <http://docs.saltstack.com/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner>. Since the command is being run as a batch, it splits the list of minions up into chunks and calls salt deploy.restart on each of them. If the sum of the time of all restarts is greater than the timeout of publish.runner any minions that didn't get called never do. Since we currently have a small number of minions this tends to often result in a single minion being left out. If we increased the number of minions, this would result in many more being left out. I've opened a bug for this: https://github.com/saltstack/salt/issues/10814 I've also added a pull request to allow a timeout argument for publish.runner: https://github.com/saltstack/salt/pull/10815 In the meantime it's possible to workaround this issue by simply ensuring batching doesn't occur by setting the batch value to 100%: service-restart --repo 'parsoid/deploy' --batch='100%' I'm assuming that since parsoid is doing graceful restarts that this shouldn't be a problem.
My reply: Without batching this is taking down the Parsoid cluster for at least a few seconds: a) Parsoid stops accepting new requests while restarting. This typically takes a few seconds, but can take up to 60 seconds if all workers are busy. b) After the restart the first request per worker will be slightly slower as the JIT is doing its thing. With Varnish backend retries some of this might not result in client-visible failures, although I am not sure that Varnish retries non-idempotent POSTs. That would mean that page saves are failing. I think I'll rather (ask a root to) do a dsh -g parsoid service parsoid restart instead until the salt bug is fixed.
A root could also run this on the salt master: salt-run deploy.restart 'parsoid/deploy' '10%'
(In reply to Ryan Lane from comment #3) > A root could also run this on the salt master: > > salt-run deploy.restart 'parsoid/deploy' '10%' Will that avoid the timeout?
Yes. The bug is calling the runner from a peer. If it's called on the master directly it works as expected.
Any of the three commands listed in comment 1 will work if run from the master.
(In reply to Ryan Lane from comment #3) > A root could also run this on the salt master: > > salt-run deploy.restart 'parsoid/deploy' '10%' to confirm, I ran that yesterday as root, got a "completed" after a little while and per Gabriel the result was good. 21:07 mutante: restarting parsoid with salt-run deploy.restart 'parsoid/deploy' '10%'
Ryan Lane wrote: > The problem here is that publish.runner has a default timeout. When that > timeout is reached, the runner stops executing. Part of the issue is that it > times out at all. The other part of this issue is that this command has no > timeout argument: > <http://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner>. <snip> > I've also added a pull request to allow a timeout argument for > publish.runner: > > https://github.com/saltstack/salt/pull/10815 Ryan patch to add a timeout to salt.modules.publish.runner has been merged (as commit 72dfd80e in https://github.com/saltstack/salt/ ). It is included in: $ git tag --contains 72dfd80e v2014.7 v2014.7.0rc1 v2014.7.0rc2 $