Last modified: 2014-09-18 20:21:20 UTC
After successfully provisioning the labs-vagrant instance flow-tests.wmflabs (see bug 70967), I rebooted it. It came up OK but terribly slow. `top` showed puppet agent at 99% of CPU for minutes. I killed it, then the next cron.d run also chewed up CPU. I killed that and ran the puppet agent command line from a terminal; it gets so far as Info: Applying configuration version '1410992606' Notice: /Stage[first]/Apt::Update/Exec[/usr/bin/apt-get update]/returns: executed successfully and never prints anything more, while at 99% of CPU. strace shows it's endlessly stat()ing the same set of 18 paths for lib/puppet/type/git::clone.rb. It stat()s "/var/lib/puppet/lib/puppet/type/git::clone.rb" twice for each time it stat()s the other 17 paths, such as "/usr/lib/ruby/1.9.1/puppet/type/git::clone.rb", "/usr/lib/ruby/1.9.1/x86_64-linux/puppet/type/git::clone.rb", etc.
$ sudo timeout -k 300 1800 puppet agent --onetime --debug --verbose --no-daemonize --no-splay --show_diff 2>&1 | tee /tmp/puppet.out shows it gets to Debug: /Stage[main]/Labs_lvm/Exec[create-volume-group]/unless: vd:r/w:772:-1:0:1:1:-1:0:1:1:8970240:4096:2190:2190:0:DE0xG4-xoJp-kANc-YU6o-RPD3-BqFG-UHi7RW Debug: Failed to load library 'msgpack' for feature 'msgpack' Debug: file_metadata supports formats: pson b64_zlib_yaml yaml raw Debug: Exec[migrate legacy files](provider=posix): Executing check '/usr/bin/test -d /mnt/vagrant' Debug: Executing '/usr/bin/test -d /mnt/vagrant' and then no more output and puppet at 99% of CPU.
I'm pretty sure that "Debug: Executing '/usr/bin/test -d /mnt/vagrant'" is not what's really running while the system goes nuts. I noticed the exact same behavior on sul-test.eqiad.wmflabs (strace full of puppet looking for puppet/type/git::clone.rb in every location that ruby might be storing a puppet type plugin. I let my process run with debug logging enabled and after a very long wait the next thing it spit out was: Debug: Exec[diamond_sudo_for_puppet_sudo_linting](provider=posix): Executing check 'test -e /etc/sudoers.d/50_diamond_sudo_for_puppet && /usr/sbin/visudo -cf /etc/s udoers.d/50_diamond_sudo_for_puppet || exit 0' Here's a bit more context around the pause: Debug: Failed to load library 'msgpack' for feature 'msgpack' Debug: file_metadata supports formats: pson b64_zlib_yaml yaml raw Debug: Exec[migrate legacy files](provider=posix): Executing check '/usr/bin/test -d /mnt/vagrant' Debug: Executing '/usr/bin/test -d /mnt/vagrant' # # ... pauses here for at least 10 minutes, possibly longer # Debug: Exec[diamond_sudo_for_puppet_sudo_linting](provider=posix): Executing check 'test -e /etc/sudoers.d/50_diamond_sudo_for_puppet && /usr/sbin/visudo -cf /etc/s udoers.d/50_diamond_sudo_for_puppet || exit 0' Debug: Executing 'test -e /etc/sudoers.d/50_diamond_sudo_for_puppet && /usr/sbin/v isudo -cf /etc/sudoers.d/50_diamond_sudo_for_puppet || exit 0' Debug: /Stage[main]/Role::Labs::Instance/Diamond::Collector::Minimalpuppetagent[mi nimal-puppet-agent]/Admin::Sudo[diamond_sudo_for_puppet]/Exec[diamond_sudo_for_pup pet_sudo_linting]/unless: /etc/sudoers.d/50_diamond_sudo_for_puppet: parsed OK Debug: Failed to load library 'msgpack' for feature 'msgpack' Debug: file_metadata supports formats: pson b64_zlib_yaml yaml raw Debug: Failed to load library 'msgpack' for feature 'msgpack' This also seems like a red herring. I'll upload the full --debug log so others can take a look at it.
Created attachment 16509 [details] puppet agent --debug output Output of: sudo puppet agent --onetime --verbose --no-daemonize --splay --splaylimit 59 --show_diff --debug On sul-test.eqiad.wmflabs ending near Thu Sep 18 01:14:52 UTC 2014
Possibly related to upstream puppet bug <https://tickets.puppetlabs.com/browse/PUP-2924>
I ran another test with this command: $ TZ=UTC strace /usr/bin/ruby /usr/bin/puppet agent --onetime --verbose --no-daemonize --no-splay --debug 2>&1 | tee /tmp/loud-puppet-strace.log This is nuts: $ grep stat\( loud-puppet-strace.log | grep git::clone | wc -l 2293677 $ grep stat\( loud-puppet-strace.log | grep :: | grep .rb | wc -l 2329047 $ grep stat\( loud-puppet-strace.log | grep :: | sed -n 's/^.*\/type\/\([^"]*\).*/\1/p' | sort | uniq -c 567 admin::sudo.rb 162 admissible_admin::sudo.rb 972 admissible_apt::conf.rb 324 admissible_apt::pin.rb 162 admissible_apt::puppet.rb 162 admissible_apt::repository.rb 324 admissible_base::syslogs::syslogs::readable.rb 162 admissible_diamond::collector::minimalpuppetagent.rb 648 admissible_diamond::collector.rb 162 admissible_ferm::rule.rb 162 admissible_generic::upstart_job.rb 162 admissible_git::clone.rb 162 admissible_labs_lvm::volume.rb 1134 admissible_nrpe::check.rb 1134 admissible_nrpe::monitor_service.rb 162 admissible_rsyslog::conf.rb 324 admissible_salt::grain.rb 324 admissible_sysctl::conffile.rb 324 admissible_sysctl::parameters.rb 3402 apt::conf.rb 918 apt::pin.rb 891 apt::puppet.rb 729 apt::repository.rb 702 base::syslogs::syslogs::readable.rb 162 completed_admin::sudo.rb 972 completed_apt::conf.rb 324 completed_apt::pin.rb 162 completed_apt::puppet.rb 162 completed_apt::repository.rb 324 completed_base::syslogs::syslogs::readable.rb 162 completed_diamond::collector::minimalpuppetagent.rb 648 completed_diamond::collector.rb 162 completed_ferm::rule.rb 162 completed_generic::upstart_job.rb 162 completed_git::clone.rb 162 completed_labs_lvm::volume.rb 1134 completed_nrpe::check.rb 1134 completed_nrpe::monitor_service.rb 162 completed_rsyslog::conf.rb 324 completed_salt::grain.rb 324 completed_sysctl::conffile.rb 324 completed_sysctl::parameters.rb 513 diamond::collector::minimalpuppetagent.rb 1782 diamond::collector.rb 459 ferm::rule.rb 405 generic::upstart_job.rb 2293353 git::clone.rb 756 labs_lvm::volume.rb 3024 nrpe::check.rb 4347 nrpe::monitor_service.rb 432 rsyslog::conf.rb 1107 salt::grain.rb 864 sysctl::conffile.rb 864 sysctl::parameters.rb
Another debug run done with: $ TZ=UTC /usr/bin/ruby /usr/bin/puppet agent --onetime --verbose --no-daemonize --no-splay --debug --trace --evaltrace --noop 2>&1 | tee /tmp/puppet-noop.log This one shows the long pause where nothing is logged (which lines up with the massive number of stat calls in other runs) as happening between this block: ^[[0;32mInfo: Git::Clone[vagrant]: Starting to evaluate the resource^[[0m ^[[0;32mInfo: Git::Clone[vagrant]: Evaluated in 0.01 seconds^[[0m And this block: ^[[0;32mInfo: /Stage[main]/Labs_vagrant/File[/srv/vagrant]: Starting to evaluate the resource^[[0m ^[[0;32mInfo: /Stage[main]/Labs_vagrant/File[/srv/vagrant]: Evaluated in 0.00 seconds^[[0m This is potentially more revealing of what is going on. The File[/srv/vagrant] resource does a recursive chown on the directory where MWV has been checked out. It also requires Git::Clone['vagrant']. My new guess at what is happening is that under the hood puppet creates a separate file resource for each file that is found in the recursive directory search. Each of these in turn inherits the require constraint. As puppet augments it's DAG with these nodes each one tries to figure out if Git::Clone is a type which leads to the insane number of stat calls. I think there are a couple of things that could be done here (besides some upstream patches to be smarter about things in general), the require on the file could be switched to be a before on the clone. My guess is that will make a big difference in the number of stat calls. The other additional thing that could be done would be to change the recursive ownership management to an exec of a find command to switches the ownership of files under the target directory. Having seen what this seems to be doing I think that would actually end up being quite a bit faster than letting puppet make a ton of ruby objects (1 for each file?) and evaluate them in series.
Change 161276 had a related patch set uploaded by BryanDavis: Stop making so many virtual resources https://gerrit.wikimedia.org/r/161276
Change 161276 merged by Andrew Bogott: Stop making so many virtual resources https://gerrit.wikimedia.org/r/161276
(In reply to Bryan Davis from comment #4) > Possibly related to upstream puppet bug > <https://tickets.puppetlabs.com/browse/PUP-2924> The upstream bug was supposedly fixed with <https://github.com/puppetlabs/puppet/pull/2979> but still the behavior we used was silly.
After this patch landed, /var/log/puppet.log says: Notice: Finished catalog run in 60.99 seconds Before that every puppet run from cron was killed with "Notice: Caught TERM; calling stop" before actually finishing.