Last modified: 2014-08-14 14:23:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70840, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68840 - Wikimetrics can't run a lot of recurrent reports at the same time
Wikimetrics can't run a lot of recurrent reports at the same time
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
Wikimetrics (Other open bugs)
unspecified
All All
: Highest normal
: ---
Assigned To: Nobody - You can work on this!
u=AnalyticsEng c=Wikimetrics p=8 s=20...
:
Depends on:
Blocks: 69252
  Show dependency treegraph
 
Reported: 2014-07-30 01:54 UTC by Dan Andreescu
Modified: 2014-08-14 14:23 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Dan Andreescu 2014-07-30 01:54:13 UTC
When adding reports for all the wiki cohorts and manually kicking off the scheduler, not all reports could successfully complete.  Of the 900 or so reports, usually only about 80-160 would run before the queue would simply stop processing reports.  No error messages, no failures, nothing, it would basically do this:

* process each recurrent report and create a run for each one (I could see in the database 900+ pending reports, all recurrent runs of the wiki cohort reports)
* presumably create the group / chain celery constructs and execute delay() on the top level chain.  No errors were reported here, and we can be sure of this as this section is inside a try block.
* execute some of the reports on the queue.  Monitoring the queue log shows this, and I could see errors for things such as mysql being unresponsive, labsdb databases not existing for obscure wikis, etc.  But none of those occurred in great number, and the queue seemed to process just fine from there on out.

To me, this means that the error is happening somewhere in "celery land", maybe something to do with the new group / chain addition we made...

Either way, this is not an optimization/bugfix that can wait.  Without it, wikimetrics simply won't be able to run the recurrent reports for each project as we hoped.
Comment 1 Dan Andreescu 2014-07-30 02:51:00 UTC
I think I found the problem.  Celery chains are not documented very well.  Apparently, if one of their children raises an error, the chain can stop:

https://github.com/celery/celery/issues/1662

This is not mentioned in the main docs for chain:

http://celery.readthedocs.org/en/latest/userguide/canvas.html

I'm trying out a patch now but I don't have tests for it yet.
Comment 2 Gerrit Notification Bot 2014-07-30 02:51:25 UTC
Change 150475 had a related patch set uploaded by Milimetric:
Fix report chain stopping

https://gerrit.wikimedia.org/r/150475
Comment 3 nuria 2014-08-05 07:57:08 UTC
>Celery chains are not documented very well.  
>Apparently, if one of their children raises an error, the chain can stop:
This makes total sense given that chains are made to pass results of one task to the next.
Comment 4 Kevin Leduc 2014-08-07 19:34:05 UTC
Collaborative tasking on etherpad:
http://etherpad.wikimedia.org/p/analytics-68840
Comment 5 Gerrit Notification Bot 2014-08-13 16:40:08 UTC
Change 150475 merged by Milimetric:
Removing usage of celery chains from report scheduling

https://gerrit.wikimedia.org/r/150475
Comment 6 nuria 2014-08-13 17:29:06 UTC
We removed chains to simplify and be able to better test our code, the bigest gain on performance however comes from the migration of labs db hosts to maria db.

https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals#Analytics

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links