Last modified: 2014-11-01 15:41:16 UTC
From time to time, some subsets of jobs are no more being executed. Zuul does enqueue them properly as can be seen on https://integration.wikimedia.org/zuul/ when the issue occurs. The Jenkins queue is idling with target hosts not running any tests. An example of a stuck job is: $ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test build:integration-jjb-config-test 2 0 14 build:integration-jjb-config-test:contintLabsSlave 0 0 14 $ Where the numbers are Total, Running, Workers. The status page shows two jobs being stuck. Another occurrence: $ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake8 17 0 14 build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14 $ And there is indeed 17 such jobs being stuck. Suspicion: both jobs are tied to the node label contintLabsSlave. Either Zuul apparently asked to run the labelless function which got properly enqueued by the Gearman server. Since the job has a label, the labelless function is never being processed by the Jenkins Gearman plugin.
Once slaves are disconnected I get: $ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test build:integration-jjb-config-test:contintLabsSlave 0 0 0 build:integration-jjb-config-test 2 0 0 $ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake8 22 0 0 build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 0 It did process a few jobs but got stuck again: $ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test build:integration-jjb-config-test:contintLabsSlave 0 0 14 build:integration-jjb-config-test 2 0 14 $ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake8 16 0 14 build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14
Disconnecting and reconnecting the gearman client does unleash a few jobs. Disconnecting and reconnecting a slave does unleash them as well. Here the debug output whenever I disconnected and reconnected integration-slave1002.eqiad.wmflabs hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake8 12 2 14 build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14 hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake8 11 1 14 build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14 hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake8 10 0 14 build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14 hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake8 10 0 14 build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14 hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake8 9 2 14 build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14 hashar@gallium:~$ It eventually managed to run them all.
I have upgraded Zuul wmf-deploy-20140122..wmf-deploy-20140416-3 . That might fix it.
We got python-gear upgraded from 0.4.0 to 0.5.4 which fix a bunch of function registrations errors in Gearman. That might solve the issue.
Seems it is no more occurring now.
That occurred again today around noon UTC. Jenkins/Zuul restarted at around 14:17 UTC :-(
Crashed again on May 28th during european afternoon. Jobs meant to be run on labs instances ended up not being registered anymore with the Zuul Gearman server. That must be a bug in the Jenkins Gearman plugin :-( {{bug|63760}}
Another occurrence: hashar@gallium:~$ echo status|nc -q 2 localhost 4730|fgrep apps-android-wikipedia-maven-checkstyle build:apps-android-wikipedia-maven-checkstyle:contintLabsSlave 0 0 10 build:apps-android-wikipedia-maven-checkstyle 10 0 10 numbers are Total, Running, Workers. And there are working function indeed: hashar@gallium:~$ echo workers|nc -q 2 localhost 4730|fgrep apps-android-wikipedia-maven-checkstyle|cut -b1-50 54 127.0.0.1 integration-slave1002_exec-3 : build: 53 127.0.0.1 integration-slave1002_exec-1 : build: 55 127.0.0.1 integration-slave1002_exec-4 : build: 56 127.0.0.1 integration-slave1002_exec-0 : build: 57 127.0.0.1 integration-slave1002_exec-2 : build: 14 127.0.0.1 integration-slave1001_exec-0 : build: 19 127.0.0.1 integration-slave1001_exec-3 : build: 21 127.0.0.1 integration-slave1001_exec-4 : build: 22 127.0.0.1 integration-slave1001_exec-2 : build: 28 127.0.0.1 integration-slave1001_exec-1 : build: The functions registered: build:apps-android-wikipedia-maven-checkstyle build:apps-android-wikipedia-maven-checkstyle:contintLabsSlave WORKAROUND: disconnect and reconnect the labs slaves.
Created attachment 15589 [details] Zuul events spike I noticed earlier this week Zuul being trapped in some loop. Upstream has noticed it as well from time to time but never managed to track it down. Attached is a graph showing the spike of events on June 6th which is caused by the death loop.
*** Bug 69045 has been marked as a duplicate of this bug. ***
*** Bug 70256 has been marked as a duplicate of this bug. ***
Documented a workaround on https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues The gearman server sometime deadlock when a job is created in Jenkins. The Gearman process is still around but TCP connections time out completely and it does not process anything. The workaround is to disconnect Jenkins from the Gearman server: head to https://integration.wikimedia.org/ci/configure logged in with a WMF ldap account search for "Gearman" uncheck "Enable Gearman" Save at the bottom search for "Gearman" check "Enable Gearman" Save at the bottom
That is related to bug 63758 (JJB created jobs not registering). I have upgraded Jenkins Gearman plugin to fix jobs registrations: * cherry picked https://review.openstack.org/#/c/125755/ patchset 8 * compiled it via maven * uploaded and restarted Jenkins That bumps gearman plugin with support for the Jenkins LTS version we are using which is probably going to help. I found out another issue that causes Gearman server to lock completely waiting for data to be received on a socket. Filled upstream as https://bugs.launchpad.net/gear/+bug/1381565
The root cause is that the Gearman server no more response for an unknown reason. When reconnecting it (see comment #12) the jobs were still stuck in the queue due to a bug in Zuul. That is bug 72113 and the patch I wrote is applied on our Zuul and confirmed to work (merge functions are now properly retriggered when Gearman comes back).