Last modified: 2013-11-06 01:52:27 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T11518, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 9518 - Job queue estimate often woefully inaccurate; need a better strategy
Job queue estimate often woefully inaccurate; need a better strategy
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
JobQueue (Other open bugs)
unspecified
Other All
: Low enhancement with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 9520 9521 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-04-07 09:26 UTC by Purodha Blissenbach
Modified: 2013-11-06 01:52 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Purodha Blissenbach 2007-04-07 09:26:17 UTC
There seems to be something filling the job queue in a loop, 
we see in pretty quick succession:
 
 5723 
 5726 
    3  (sometimes)
    2 
 5723  
... etc.

If this is caused by possibly recursive behaviour in template 
calls, it was likely introduced - or stirred into the job 
queue - with changes made during the recent days, i.e. most 
likely on or after April, 4th., when I altered inside a system 
of interdependent templates which I mostly did not write or 
design, and at best partially understand. I tried to simplify 
them using parser functions now instead of highly complicated 
other contition-evaluating templates of the pre-parser-
function era.
I remember, I had inspected the job queue after such a change 
and saw it go down to zero, but frankly cannot recall when.
Comment 1 Rob Church 2007-04-07 09:30:26 UTC
The job queue is a queue of link updates waiting to be processed. We have an
external script running on a continuous cycle to perform these updates. Don't
worry if the queue fills up; it will be processed in turn.

A high job queue count after editing a heavily-used template is normal. That the
job queue might not fall to zero on well-established projects with frequent
edits is also normal.
Comment 2 Rob Church 2007-04-07 10:56:27 UTC
*** Bug 9520 has been marked as a duplicate of this bug. ***
Comment 3 Rob Church 2007-04-07 10:56:33 UTC
*** Bug 9521 has been marked as a duplicate of this bug. ***
Comment 4 Purodha Blissenbach 2007-04-07 12:02:04 UTC
There have been zero edits since 2 hours+ according to 
both [[Special:Recentchanges]] and the rc IRC cannel.

The job queue count appears to be cyclic with a 
repitition period of considerably less than a minute.

I have never ever seen a nonzero job queue count for 
extended periods of time on this project, which I 
monitor closely, i.e. more frequent than daily by 
average.

I suggest that someone with server access should have a 
look at the job queue and report which pages are 
inserted, and/or always kept. I am pretty certain, there 
is some loop there. Knowing the pages involved, I might 
likely come to some guesses as to its origin.

If there is an endless loop feeding the job queue 
indeed, precautions taken in the wiki software so as not 
to waste server ressources.
Comment 5 Rob Church 2007-04-07 12:08:58 UTC
The job runner works on a cycle across all wikis. It may take some time to reach
yours.
Comment 6 Purodha Blissenbach 2007-04-07 12:50:37 UTC
I know that it usually takes much more time until a job 
queue length change is shown in [[Special:Statistics]] 
for this wiki.

Please point a browser to the http://ksh.wikipedia.org/
wiki/Spezial:Statistik and make it reload the page some 
60 times in succession over ~2 minutes.

If you then believe what you see is healthy, I shall 
give up reopening this bug.

Keep in mind, this is going on since 2 hours and a half 
at least, and there have been no edits at all, including 
no bot edits at all, but a quickly & repititively 
changing job queue size, all the time.

Comment 7 Domas Mituzas 2007-04-07 12:55:49 UTC
is this long term issue? 
since this morning the job queue is approximate, and different servers may return different approximations.

if a server has long open transaction, purged rows will appear included into estimate, hence showing big number, whereas there'll be less actual job entries. 
Comment 8 Purodha Blissenbach 2007-04-07 13:12:40 UTC
This is a good hint at least explaining the grossly 
altered "general behaviour" since this morning.

We see rapid changes in the job queue length displayed, 
about a dozen or more cycles per minute - while 
experience teaches that there sould be no job queue.

If the approximation may lead to a nonzero figure when 
the queue is/was empty for a long time, then, yes, that 
would possibly explain what one sees. From a users 
standpoint of view, it may be bewildering, of course.

If a constant zero queue length (at least after some 
time less than 2 hours) should also be reflected on 
[[Special:Statistics]] as zero, then we still have no 
obvious explanation for the figures shown.
Comment 9 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-04-08 00:55:34 UTC
After the one-sentence explanation of how it works that _syphilis_ gave me on IRC, I would guess 
that it should show 0 if the length is 0.  If it were more than 0, though, the number would probably 
fluctuate randomly every time you refresh the page, which may be what you're seeing.

Remember, the job queue is shared, and the statistic given on ksh-wiki's page reflects job queue 
items created by many other wikis, so it's not surprising that it should be nonzero all the time.

Sounds normal to me, but I'll let a server admin decide whether to close.
Comment 10 Purodha Blissenbach 2007-04-08 12:41:30 UTC
Ok, fluctuation explained.

Nonzero for hours of no edit activity at is at least 
extremely unlikely, and there should be no cause for it, 
if everything was as it should be.

Observerd quick fluctuation following a predictable 
scheme for more than 26 hours now (assuming it did not 
change back and forth while noone was watching) is 
either an error in the wiki - e.g. circular definitions 
in templates (which imho would not explain the speed of 
change) - or probably something pretty strange in the 
display of the firgures themselves.

I would like the actual job queue entries on the server 
inspected so as to get better clues.

I did inspect and follow the job queue figures on some 
other Wikipedia wikis, observing similar patterns at:

be: 146 - 257 - 366 - 967
qu: 5 - 5 - 5 - 1
es: 286 - 24 - 24 - 286

but not everywhere. Of course, I only followed them for 
few minutes. Yet:

ksh: 5723 - 5726 - 3 - 2 

seems to be happening since 26 hours pretty unaltered.
Comment 11 Daniel Beyer 2007-08-11 11:15:12 UTC
The Job Queue on ma testwiki is never empty as well. After new installation of mw the jobqueue shows "1". After editing a template the queue was at 2500. With runjobs.php i can reduce this to 109 but nit more. Now there are always at least 109 jobs in the queue even after running runjobs.php.

This error only appears on mw 1.10. on 1.9 the job couont works.
Comment 12 Purodha Blissenbach 2007-08-14 11:19:09 UTC
in bug #10417 comment #4 , brion vibber writes something that seems contradicitve to several explanatory statements given above. maybe this only reflects that i have not understood enough.

at least the job queue on kshwiki is showing seemingly random figures since months which do not say anything. there is no way any more to tell when ones template changes have been propagated, even if there is only one editor + one edit for extended periods of time, as evident from recent changes.
Comment 13 Rob Church 2007-08-14 12:42:51 UTC
The job queue is a processing queue for operations which can be deferred. This queue is stored in the database and can be processed via a couple of methods.

The default method is to execute one job per page view provided that a job can be obtained without causing deadlocks on the job table. This execution rate can be tweaked in the site configuration. We do not use this method on Wikimedia wikis.

The second method is to execute a maintenance script which processes all jobs in the queue at once. On Wikimedia wikis, we use this method in a specific manner, that is, we have multiple application servers which are allocated a pool of wikis, and these run a script which loops through each wiki in turn, completes the job queue for that wiki, and moves on. I suspect a previous incomplete explanation of this has caused the confusion.

The job queue count shown on Special:Statistics is generated using a clever trick which avoids an expensive COUNT(*) on the job table. This trick means that the value will fluctuate, often to the point where it's downright incorrect. Our replicated environment also doesn't help much in this regard.

There have been previous discussions regarding this count. There are those in favour of removing it, since it provides a means for some clueless users to spread FUD. There are suggestions that we could generate an accurate count and cache this for a period of time, although this has drawbacks too, since the cached data may be wholly inaccurate. There are further suggestions to graph the data in some manner, providing a more useful visual overview of what's going on.

Due to the nature of a large number of Wikimedia wikis (that is, that they are often prone to heavy editing), and due to the uses we make of the job queue, any given wiki may, for a short (for some definition of "short") time, have a large job queue. ** This is not a problem, it will be dealt with when the cycle returns to process that particular wiki. **

Third parties who are still worried about their job queue sizes are encouraged to:

* run maintenance/showJobs.php or execute a SELECT COUNT(*) FROM `job` against their database, to determine what's left on the queue
* run maintenance/runJobs.php on a periodic basis (or set up a cron job) to ensure that larger queues are processed

There is no actual bug here, just a common misperception about what is being shown and what is happening behind the scenes.
Comment 14 Brion Vibber 2007-08-14 17:49:39 UTC
Reopening. The clever index trick shows wholly useless incorrect numbers in many cases, so I would indeed recommend improvements.
Comment 15 Rob Church 2007-08-14 18:14:56 UTC
Repurposing to focus on this.

Just to ensure the suggestion is documented; we could run the estimate, and then run the actual COUNT(*) if the estimate returns a number of rows below some threshold.
Comment 16 Lars Aronsson 2008-03-31 11:29:09 UTC
Currently, [[Special:Statistics]] only shows the length of the job queue. I don't mind 1000 jobs in queue if they are executed in a millisecond each, but if they take two seconds each I might as well take a lunch break.

I suggest the age of the job at the head of the queue should also be reported. If the job to be executed next is one hour old, I can expect my newly added job to be executed within roughly one hour. This is not an expensive computation, just the retrieval of a single attribute from a single object.

Comment 17 MZMcBride 2009-02-17 05:42:53 UTC
One possible option: Split out the line from Special:Statistics and making a Special:JobQueue. The page would have total number of jobs, the bottom ten jobs in a table, and have a mechanism to run the jobs for wikis that do it manually (to allow siteadmins not to have to do it from the command line).
Comment 18 Niklas Laxström 2009-05-23 08:17:35 UTC
Why is this number shown at all? At what circumstances it would be useful to know how many jobs there are. Would simple "there are jobs" "there are no jobs" be sufficient?
Comment 19 Gurch 2009-05-23 17:57:35 UTC
(In reply to comment #18)
> Would simple "there are jobs" "there are no jobs"
> be sufficient?

On a large wiki there are always jobs anyway. But yes, the number is meaningless to users and there's no real reason to show it.
Comment 20 Mike.lifeguard 2009-08-21 17:37:34 UTC
Domas had some idea about this the other day.
Comment 21 Niklas Laxström 2010-04-15 12:50:19 UTC
I removed them from the user interface in r65059, since it was just confusing users. The number is still accessible, so the issue is not solved. It's just smaller priority now.
Comment 22 Aaron Schulz 2013-06-25 21:28:19 UTC
(In reply to comment #21)
> I removed them from the user interface in r65059, since it was just confusing
> users. The number is still accessible, so the issue is not solved. It's just
> smaller priority now.

That number will be useless for WMF sites, since the queue is not in that table.
Comment 23 Gerrit Notification Bot 2013-07-04 07:26:21 UTC
Change 71966 had a related patch set uploaded by Aaron Schulz:
jobqueue: improved performance of JobQueueGroup::getQueuesWithJobs()

https://gerrit.wikimedia.org/r/71966
Comment 24 Gerrit Notification Bot 2013-10-01 03:23:52 UTC
Change 71966 merged by jenkins-bot:
jobqueue: improved performance of JobQueueGroup::getQueuesWithJobs()

https://gerrit.wikimedia.org/r/71966

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links