Last modified: 2014-08-14 12:20:30 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71272, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69272 - Beta cluster job queue not running
Beta cluster job queue not running
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
unspecified
All All
: Unprioritized critical
: ---
Assigned To: Bryan Davis
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-07 22:22 UTC by Jon
Modified: 2014-08-14 12:20 UTC (History)
15 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
/etc/jobrunner/jobrunner.conf on deployment-jobrunner01.eqiad.wmflabs (2.54 KB, application/octet-stream)
2014-08-08 14:11 UTC, Antoine "hashar" Musso (WMF)
Details

Description Jon 2014-08-07 22:22:09 UTC
I logged in to beta labs as User:jdlrobson and made this edit:
http://en.wikipedia.beta.wmflabs.org/w/index.php?title=User_talk%3ASelenium_user&diff=119002&oldid=119000

When I log in as Selenium user I do not see a notification for this event.
Comment 1 Kunal Mehta (Legoktm) 2014-08-07 23:56:12 UTC
legoktm@deployment-bastion:~$ mwscript showJobs.php --wiki=enwiki --group
htmlCacheUpdate: 74 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
enotifNotify: 41 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
cirrusSearchDeletePages: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
cirrusSearchLinksUpdatePrioritized: 3104 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
LocalRenameUserJob: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
updateBetaFeaturesUserCounts: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
ParsoidCacheUpdateJobOnEdit: 2931 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
ParsoidCacheUpdateJobOnDependencyChange: 5676 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
EchoNotificationJob: 1558 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
Comment 2 Antoine "hashar" Musso (WMF) 2014-08-08 14:10:29 UTC
There is no component for jobrunner yet in bugzilla (bug 68318). Ccing authors Aaron and Ori.
Comment 3 Antoine "hashar" Musso (WMF) 2014-08-08 14:11:20 UTC
Created attachment 16158 [details]
/etc/jobrunner/jobrunner.conf on deployment-jobrunner01.eqiad.wmflabs

We have a jobrunner for generic jobs: deployment-jobrunner01.eqiad.wmflabs which has the puppet class role::beta::jobrunner applied.

I reran puppet on the instance:

 Notice: /Stage[main]/Mediawiki::Jobrunner/Service[jobrunner]/ensure: ensure changed 'stopped' to 'running'
 Info: /Stage[main]/Mediawiki::Jobrunner/Service[jobrunner]: Unscheduling refresh on Service[jobrunner]

But the service does not start:

 # service jobrunner status
 jobrunner stop/waiting
 #


In /var/log/syslog I found out:

Aug  8 14:00:52 deployment-jobrunner01 php:

PHP Warning:  syntax error, unexpected '{' in /etc/jobrunner/jobrunner.conf on line 3#012 in 
 /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 128

PHP Fatal error:  Uncaught exception 'Exception' with message
  'Could not parse file at '/etc/jobrunner/jobrunner.conf'.' in
  /srv/deployment/jobrunner/jobrunner/redisJobRunnerService:132#012Stack trace:#012#0
  /srv/deployment/jobrunner/jobrunner/redisJobRunnerService(51): RedisJobRunnerService::init(Array)#012#1 {main}#012
  thrown in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 132


The file /etc/jobrunner/jobrunner.conf is a json file managed by puppet and it is invalid:

 php > $json = file_get_contents('/etc/jobrunner/jobrunner.conf');
 php > var_dump( json_decode( $json ) );
 NULL
 php > 

PHP json_decode() returns NULL if the json cannot be decoded or if the encoded data is deeper than the recursion limit.


I have attached the file
Comment 4 Antoine "hashar" Musso (WMF) 2014-08-08 14:19:28 UTC
The file has inline comments using // which is not supported by PHP json_decode(). Removing the comment fix the issue.
Comment 5 Antoine "hashar" Musso (WMF) 2014-08-08 14:24:19 UTC
I believe the new jobrunner service is only used on HHVM. So adding keyword hiphop.
Comment 6 Bryan Davis 2014-08-08 14:48:34 UTC
The deployed version of jobrunner in beta had lagged behind the configuration. Additionally there was a local hotpatch on deployment-jobrunner01 that prevented trebuchet from updating the checkout properly. I fixed these two things and the jobrunner is operational again.

(In reply to Antoine "hashar" Musso from comment #4)
> The file has inline comments using // which is not supported by PHP
> json_decode(). Removing the comment fix the issue.

This is actually handled in the latest version. Aaron strips the comments before parsing the file as json.
Comment 7 Bryan Davis 2014-08-08 14:52:39 UTC
(In reply to Antoine "hashar" Musso from comment #5)
> I believe the new jobrunner service is only used on HHVM. So adding keyword
> hiphop.

The new jobrunner is actually compatible with both php5 and hhvm. We are running it in production on both interpreters.
Comment 8 Antoine "hashar" Musso (WMF) 2014-08-08 15:14:06 UTC
Excellent. Thank you very much :]
Comment 9 Kunal Mehta (Legoktm) 2014-08-08 15:32:17 UTC
Thanks for looking into this quickly. I still see the same number of jobs (well there are more now...) queued though?

htmlCacheUpdate: 81 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
enotifNotify: 43 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
cirrusSearchDeletePages: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
cirrusSearchLinksUpdatePrioritized: 3170 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
LocalRenameUserJob: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
updateBetaFeaturesUserCounts: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
ParsoidCacheUpdateJobOnEdit: 2992 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
ParsoidCacheUpdateJobOnDependencyChange: 5791 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
EchoNotificationJob: 1640 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
Comment 10 Bryan Davis 2014-08-08 16:02:32 UTC
It looks like we have a configuration issue. I see `"runners": 0,` for all of the groups in the config file. This is probably a puppet problem.
Comment 11 Antoine "hashar" Musso (WMF) 2014-08-08 16:06:41 UTC
role::beta::jobrunner has:

     class { '::mediawiki::jobrunner':
        aggr_servers  => [ '10.68.16.146' ],
        queue_servers => [ '10.68.16.146' ],
    }


And the puppet class ::mediawiki::jobrunner has all settings to default to 0 :-]
Comment 12 Gerrit Notification Bot 2014-08-08 16:16:40 UTC
Change 152931 had a related patch set uploaded by Hashar:
beta: Set runners_* for role::beta::jobrunner

https://gerrit.wikimedia.org/r/152931
Comment 13 Bryan Davis 2014-08-08 16:27:11 UTC
[17:13]  <    bd808>	 legoktm: Can you check the job count again?
[17:13]  <  legoktm>	 yay it's going down!
[17:13]  <  legoktm>	 EchoNotificationJob: 425 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

Patch is in beta via cherry-pick
Comment 14 Aaron Schulz 2014-08-13 20:18:20 UTC
Can this be closed now?
Comment 15 Antoine "hashar" Musso (WMF) 2014-08-14 12:18:42 UTC
deployment-bastion:~$ mwscript showJobs.php --wiki=enwiki --group
cirrusSearchLinksUpdatePrioritized: 0 queued; 3 claimed (0 active, 3 abandoned); 0 delayed
$

I guess it is ok now :)
Comment 16 Gerrit Notification Bot 2014-08-14 12:20:30 UTC
Change 152931 merged by Ori.livneh:
beta: Set runners_* for role::beta::jobrunner

https://gerrit.wikimedia.org/r/152931

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links