Last modified: 2013-07-03 02:35:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T45287, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 43287 - Indicate current JobQueue delay by exposing oldest job_timestamp through API
Indicate current JobQueue delay by exposing oldest job_timestamp through API
Status: NEW
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
1.21.x
All All
: Low enhancement with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-20 11:03 UTC by Richard Guk
Modified: 2013-07-03 02:35 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Richard Guk 2012-12-20 11:03:37 UTC
At present, the number of jobs in the job queue is exposed through the API but the delay caused by queuing is not. But the age of the oldest job would often be more helpful to know (at least for editors, who can be concerned or confused if they see categories unchanged for some time after pages are edited).

The API, through ApiQuerySiteinfo::appendStatistics() and SiteStats::jobs(), already exposes the estimated number of queued jobs:

http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics
-> <api><query><statistics ... jobs="918518" /></query></api>

Since MediaWiki version 1.19, the job table has included a job_timestamp field. The field is already indexed. Therefore exposing MIN(job.job_timestamp) as an additional API output should be easy and efficient.

An alternative or additional new statistic would be the queue duration, i.e.:
 time() - MIN(job.job_timestamp)
This relative measure would be more suitable for graphing, especially if the site statistics are aggressively cached (since it would typically be more stable than the absolute timestamp during a caching interval).

The API could then return something like:

http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics
-> <api><query><statistics ... jobs="918518" joboldesttime="2012-12-19T10:59:59Z" joboldestseconds="86412"  /></query></api>

(Incidentally, the queue duration might be a useful or at least interesting additional metric for Ganglia, since it would help to distinguish a pathological backlog from high throughput.)
Comment 1 Sam Reed (reedy) 2012-12-20 11:15:44 UTC
Just exposing the oldest job_timestamp isn't much use. You could have the oldest jobs that haven't been run/picked for some reason, but then most of the queue is of much newer time...
Comment 2 Richard Guk 2012-12-20 11:26:51 UTC
(In reply to comment #1)
> Just exposing the oldest job_timestamp isn't much use. You could have the
> oldest jobs that haven't been run/picked for some reason, but then most of
> the queue is of much newer time...

I have assumed that jobs are run either FIFO or (in effect) randomly; so that, either way, the timestamp of the oldest job would be a meaningful indicator of the maximum expected delay before the link table is updated after a recent edit.

Is there a non-pathological reason why some jobs would be left in the queue unrun/unpicked for an exceptional length of time?
Comment 3 Max Semenik 2012-12-20 11:32:51 UTC
Jobs are pulled from queue randomly these days, so this metric may be less meaningful.
Comment 4 Richard Guk 2012-12-20 12:16:27 UTC
(In reply to comment #3)
> Jobs are pulled from queue randomly these days, so this metric may be less
> meaningful.

Even in the worst such case, a median time could be reported to at least indicate the "typical" delay.

But frankly, if old jobs remain stuck in the queue for so long that MIN(job_timestamp) becomes effectively meaningless, then the picking method itself is dubious - which is a bad reason for not exposing the metric!

Though the queue age might seem unrelated to operational concerns (since the JobQueue is a documented and controlled breach of database consistency), the duration is relevant to the general issue of database integrity (because, like cached pages, links should not be out of date indefinitely).

More practically, a disclosed statistic would reassure editors, who frequently and understandably ask: "Will my template edit or category change propagate, even though I saved the edit ages ago and nothing has happened?"

If editors knew that pages would be updated in a reasonable and foreseeable length of time, they would have less resort to the current practice of purging and saving null edits to bypass the inscrutable job queue at the expense of greater load on the servers as well as on the editors concerned!
Comment 5 Sam Reed (reedy) 2012-12-20 12:24:08 UTC
Hopefully, all the work Aaron has put into overhauling the job queue code should mean it's much better from now on.

But it is somewhat arbitrary. The job queue count was removed from user view (Special:Statistics? I think) because it was misleading. It was left in the API for developers etc.
Comment 6 Richard Guk 2012-12-20 12:55:32 UTC
(In reply to comment #5)
> Hopefully, all the work Aaron has put into overhauling the job queue code
> should mean it's much better from now on.

Even more reason to have a way to measure the result of Aaron's work! Not knowing the detail, I can't think why FIFO would not be the preferred picking algorithm, given that the timestamp is indexed; but whatever the method, you'd surely not want any jobs to remain unprocessed for days, weeks or months at a time?

> But it is somewhat arbitrary. The job queue count was removed from user view
> (Special:Statistics? I think) because it was misleading. It was left in the
> API for developers etc.

I agree that the size of the queue was not directly relevant to editors. But turning your comment on its head, I would say that editors deserve to know about likely propagation delays (per comment #4), and that the obscurity of the API (from most editors' perspective) means that it would have been far better if [[Special:Statistics]] had added this additional information instead of removing the little information that it used to expose about the queue.

Readers expect modern websites to be more-or-less up-to-date. Logged-in editors are used to seeing wiki pages that are current in all other respects. So editors, admins, tech ops and servers all benefit from avoiding enquiries and manual purging by making durations easily identifiable, distinguishing routine high I/O from exceptional delays.
Comment 7 Max Semenik 2012-12-20 14:31:03 UTC
(In reply to comment #6)
> I agree that the size of the queue was not directly relevant to editors. But
> turning your comment on its head, I would say that editors deserve to know
> about likely propagation delays (per comment #4), and that the obscurity of
> the
> API (from most editors' perspective) means that it would have been far better
> if [[Special:Statistics]] had added this additional information instead of
> removing the little information that it used to expose about the queue.

One of reasons this number was considered misleading was that for performance reasons queue size is estimated, not counted. And these estimations can be wildly inaccurate. This will remain an issue at least as long as the queue is kept in MySQL.
Comment 8 Richard Guk 2012-12-20 15:09:59 UTC
(In reply to comment #7)
> One of reasons this number was considered misleading was that for performance
> reasons queue size is estimated, not counted. And these estimations can be
> wildly inaccurate. This will remain an issue at least as long as the queue is
> kept in MySQL.

Understood (per bug 27584). I only mentioned the Special:Statistics removal in response to reedy's aside. The API jobs value seems to oscillate disconcertingly between 3 disparate values at any one time (presumably because of aggressive caching), but that oddity is outside the scope of this request.

Focusing again on reporting the queue duration:

The proposed information would be easy to calculate accurately, as well as being more useful to know than the queue size. Even if value caching were needed (doubtful), the value would only need refreshing each time an oldest job were run. But the minimum value of an indexed field is already perfectly optimised for a database lookup.

When this enhancement was previously requested (bug 13786, which I have only just found), it was rejected because the job_timestamp column did not exist at the time. So the main hurdle has already been overcome.
Comment 9 MZMcBride 2012-12-31 23:17:06 UTC
This bug seems like a duplicate of bug 9518 or bug 13786.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links