Last modified: 2014-08-27 23:12:04 UTC
At the moment, qdel KILLs the job; this is a bit rude. If jsub would call "qsub -notify", SGE would signal the job before KILLing it. The signal is set by "execd_param"'s NOTIFY_KILL; default is SIGUSR1, I would favour SIGTERM (or a SIGHUP -> SIGINT -> SIGTERM cascade) as I suppose more programs will already have a suitable handler for that. The queue parameter "notify" defines the interval between signals; given that many jobs in Tools use database and other network connections, I would be fairly generous here and propose 60 s (that means in the worst case of a SIGHUP -> SIGINT -> SIGTERM -> SIGKILL cascade 180 s which I find acceptable; for special cases, roots can always log into the exec node and kill at will).
*** Bug 63878 has been marked as a duplicate of this bug. ***
We need to use "qsub -notify" in webservice as well.
Concerning (non) termination of php-cgi processes: http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ModFastCGI There is an option "kill-signal" in .lighttpd fcgi settings. "kill-signal": By default lighttpd send SIGTERM to FastCGI processes, which were spawned by lighttpd. Applications, which link libfcgi, need to be killed with SIGUSR1. This applies to php <5.2.1, lua-magnet and others. I tried setting this value to 9, also to 1. But in neither case, the signal was forwarded to the spawned cgi-processes, while killing with 9 and 1 by hand worked. This (mis)behaviour seems to matter also in the case of overloaded and dying webservices, as overloaded threads/processes are /not/ terminated as they should be.
The program flow is different at the moment: On qdel, SGE kills the master lighttpd process with SIGKILL. Thus, lighttpd never has a chance to kill the php-cgi processes. So kill-signal is irrelevant at the moment.
The grid has been adjusted to use SIGTERM by default now; this problem should be solved.