Last modified: 2014-05-23 18:54:46 UTC
For the time until we fix bug #61102, I have installed a script /home/scfc/bin/cleanup-php-cgis per crontab on tools-login to kill orphaned php-cgi processes on tools-webgrid-01 and tools-webgrid-02. During its development on April 27th I had started a faulty version of it that called "sudo kill -HUP" ad infinitum on the webnodes even when there were no php-cgi processes to kill, adding about 4 KByte/s to /var/log/auth.log, thus filling up /var. The correct version installed per crontab only logs about 1 KByte/5 minutes (ssh connect from tools-login to tools-webgrid-01/tools-webgrid-02). There was a sparkle where I could have noted the error as my installed script sometimes complained about processes disappearing between detection and killing which I assumed was the odd correct php-cgi shutdown, but in reality apparently was just a race condition between the competing scripts. I've inspected tools-login, tools-webgrid-01 and tools-webgrid-02 for any ancient processes, and there are now none. Also, I moved /var/log/auth.log to /data/project/admin/auth.log.scfc.bz2 and "stop rsyslogd && start rsyslogd" to get tools-webgrid-01 going again. /var/log/auth.log would normally be kept for about four weeks, so I'll leave this bug open to either remove /data/project/admin/auth.log.scfc.bz2 in a month or braid it back into the logrotate process in two weeks when it would normally be compressed as well.
I've now moved auth.log.4.gz to auth.log.5.gz and /data/project/admin/auth.log.scfc.bz2 (re-compressed) to auth.log.4.gz.