Last modified: 2013-06-10 23:06:13 UTC
Addshore has made a nice graph at http://tools.wmflabs.org/addshore/toolslab/ that shows the number of jobs running on the grid. It would be preferable to have this properly integrated in Ganglia. The "official" contrib repo has SGE functionality at https://github.com/ganglia/gmetric/tree/master/hpc/sge_jobs (cf. also http://comments.gmane.org/gmane.comp.monitoring.ganglia.general/1920). If I understand the Puppet structure correctly, both sge_jobs.sh and jobqueue_report.php need to puppetized on the Ganglia side, but I'm confused by ganglia and ganglia_new.
1. Apparently, the Puppet modules are structured the other way round: A module typically has a ::monitoring class that adds the gathering thingy to the node. At Tools, the proper class would probably be gridengine::master::monitoring to be deployed exactly once per SGE cluster. 2. No report has to be defined on the Ganglia side at all. If one feeds it data, it will make sense of it on its own. 3. As a test, I have set up ~scfc/bin/sge_jobs.pl to be run on tools-login every fifteen minutes. It gathers information on pending, running and error jobs, and submits it to Ganglia. The graphs can be found at http://ganglia.wmflabs.org -> tools -> tools-login -> sge_pending/sge_running/sge_error. I intend to leave it running for a few days before puppetizing.
Looks lovely, I am guessing it is not exactly an intensive task so could we have it running minutely?
(In reply to comment #2) > Looks lovely, I am guessing it is not exactly an intensive task so could we > have it running minutely? On the grid side, yes, but I don't know how much load this causes on Ganglia (three data points per minute -> 180 per hour -> etc., graph generation may take longer, etc.), so I before "turning it to 11" I'd prefer Ryan's okay.
Compare to the amount of data ganglia already receives about labs I don't think it will have much effect :)
This is cool. I have a feeling you can run it more often than that without much impact. A ton of data is already sent to the servers.
(In reply to comment #5) > This is cool. I have a feeling you can run it more often than that without > much > impact. A ton of data is already sent to the servers. Okay, increased it to an update every minute.
Related URL: https://gerrit.wikimedia.org/r/64511 (Gerrit Change I48a65620d2fa5ee0fa3d147f9157af60c44c31c3)
Gerrit change #64511 (and the fix in Gerrit change #67899) got merged, so the status is now available at <http://ganglia.wmflabs.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=tools&h=tools-master&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4#mg_SGE_div>. I've removed my cron job on tools-login.