Last modified: 2014-04-24 13:47:29 UTC
"webservice stop" stops the lighttpd processes, but not the php-cgi processes. If the lighttpd is restarted on a different webgrid host, these just become zombies. More importantly, if lighttpd is restarted on the same webgrid host, they probably retain their original environment, so changes in the configuration might not affect them.
*** Bug 64095 has been marked as a duplicate of this bug. ***
Idea: Replace "qdel -j $job" with "ssh $WEBGRIDHOST 'kill -TERM $(cat /var/run/lighttpd/wikilint.pid)'". This will make lighttpd shut down in an orderly fashion taking the php-cgi processes with it (and even offers a model for graceful shutdowns with "kill -INT"). I asked on users@gridengine.org (cf. http://permalink.gmane.org/gmane.comp.clustering.opengridengine.user/7487) how to find out the host a job is running on, but didn't get an "easy" answer (yet). Working: | qstat -xml | xmllint --xpath "substring-after(/job_info/queue_info/job_list[@state = 'running' and JB_name = 'lighttpd-wikilint']/queue_name/text(), '@')" -
The problem with this approach would be the dependence on "webservice stop" being the only way to kill a job. If for example the grid would transfer the job to another host, it would still just use SIGKILL, and we would be back at square one. So the sensible solution is to use "qsub -notify" and a suitable set of signals and timeouts. *** This bug has been marked as a duplicate of bug 61102 ***