Last modified: 2014-08-14 14:23:38 UTC
When adding reports for all the wiki cohorts and manually kicking off the scheduler, not all reports could successfully complete. Of the 900 or so reports, usually only about 80-160 would run before the queue would simply stop processing reports. No error messages, no failures, nothing, it would basically do this: * process each recurrent report and create a run for each one (I could see in the database 900+ pending reports, all recurrent runs of the wiki cohort reports) * presumably create the group / chain celery constructs and execute delay() on the top level chain. No errors were reported here, and we can be sure of this as this section is inside a try block. * execute some of the reports on the queue. Monitoring the queue log shows this, and I could see errors for things such as mysql being unresponsive, labsdb databases not existing for obscure wikis, etc. But none of those occurred in great number, and the queue seemed to process just fine from there on out. To me, this means that the error is happening somewhere in "celery land", maybe something to do with the new group / chain addition we made... Either way, this is not an optimization/bugfix that can wait. Without it, wikimetrics simply won't be able to run the recurrent reports for each project as we hoped.
I think I found the problem. Celery chains are not documented very well. Apparently, if one of their children raises an error, the chain can stop: https://github.com/celery/celery/issues/1662 This is not mentioned in the main docs for chain: http://celery.readthedocs.org/en/latest/userguide/canvas.html I'm trying out a patch now but I don't have tests for it yet.
Change 150475 had a related patch set uploaded by Milimetric: Fix report chain stopping https://gerrit.wikimedia.org/r/150475
>Celery chains are not documented very well. >Apparently, if one of their children raises an error, the chain can stop: This makes total sense given that chains are made to pass results of one task to the next.
Collaborative tasking on etherpad: http://etherpad.wikimedia.org/p/analytics-68840
Change 150475 merged by Milimetric: Removing usage of celery chains from report scheduling https://gerrit.wikimedia.org/r/150475
We removed chains to simplify and be able to better test our code, the bigest gain on performance however comes from the migration of labs db hosts to maria db. https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals#Analytics