Last modified: 2013-06-02 18:37:03 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T44614, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 42614 - JobQueueDB::claimRandom very slow with lots of jobs, takes half of the total time for queries
JobQueueDB::claimRandom very slow with lots of jobs, takes half of the total ...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
JobQueue (Other open bugs)
1.21.x
All All
: Highest critical with 1 vote (vote)
: 1.20.x release
Assigned To: Aaron Schulz
: code-update-regression, performance
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-01 22:08 UTC by Niklas Laxström
Modified: 2013-06-02 18:37 UTC (History)
13 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---
federicoleva: Backport_to_Stable-


Attachments

Description Niklas Laxström 2012-12-01 22:08:36 UTC
mysql> EXPLAIN SELECT /* JobQueueDB::claimRandom 127.0.0.1 */ * FROM `job` WHERE job_cmd = 'MessageGroupStatesUpdaterJob' AND job_token = '' AND (job_random <= '2108540693') ORDER BY job_random DESC LIMIT 1;
+----+-------------+-------+------+----------------------------------------+---------------+---------+-------------+-------+-------------+
| id | select_type | table | type | possible_keys                          | key           | key_len | ref         | rows  | Extra       |
+----+-------------+-------+------+----------------------------------------+---------------+---------+-------------+-------+-------------+
|  1 | SIMPLE      | job   | ref  | job_cmd,job_cmd_token,job_cmd_token_id | job_cmd_token | 96      | const,const | 43440 | Using where |
+----+-------------+-------+------+----------------------------------------+---------------+---------+-------------+-------+-------------+
1 row in set (0.00 sec)

mysql> SELECT /* JobQueueDB::claimRandom 127.0.0.1 */ * FROM `job` WHERE job_cmd = 'MessageGroupStatesUpdaterJob' AND job_token = '' AND (job_random <= '2108540693') ORDER BY job_random DESC LIMIT 1;+--------+------------------------------+---------------+--------------------------------------------------------+----------------+------------+------------+-----------+---------------------+---------------------------------+--------------+
| job_id | job_cmd                      | job_namespace | job_title                                              | job_timestamp  | job_params | job_random | job_token | job_token_timestamp | job_sha1                        | job_attempts |
+--------+------------------------------+---------------+--------------------------------------------------------+----------------+------------+------------+-----------+---------------------+---------------------------------+--------------+
| 447774 | MessageGroupStatesUpdaterJob |          1202 | Konsolekalendar-e2ca93-Add_location_to_incidence_(f/en | 20121201204852 | a:0:{}     | 2108426131 |           | NULL                | f78vrjeo5e8f17ackthos9uhjzw686y |            0 |
+--------+------------------------------+---------------+--------------------------------------------------------+----------------+------------+------------+-----------+---------------------+---------------------------------+--------------+
1 row in set (3.62 sec)


It's going to take ages to clear all the jobs if picking each one takes 3 seconds.

php maintenance/showJobs.php 
86324
Comment 1 Niklas Laxström 2012-12-02 08:25:43 UTC
It has now been running overnight:
php maintenance/showJobs.php 
67285

Not good.
Comment 2 Nemo 2012-12-03 11:09:33 UTC
I don't know if it's the same bug (Niklas said "possibly"), but it seems to affect also Wikimedia projects, as reported by Seddon (I guess his translation notifications on Meta are stuck again).
Since the 29 very few new jobs are run, only on en.wiki the queue raised from 40k to 150k items and the load on the job runners decreased from about 100 to about 50.
https://gdash.wikimedia.org/dashboards/jobq/deploys
http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=spence.wikimedia.org&v=506&m=enwiki_JobQueue_length
http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Jobrunners+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

(I suppose this deserves being highest priority and this now requires an assignee?)
Comment 3 Aaron Schulz 2012-12-03 20:22:57 UTC
According to https://ishmael.wikimedia.org for enwiki, that query is not in terms of average (or even p95) query time. If there was a problem with it being slow, I would expect gdash to show the same queue rate but lower run rate.
Comment 4 Nemo 2012-12-03 20:41:43 UTC
(In reply to comment #3)
> According to https://ishmael.wikimedia.org for enwiki, that query is not in
> terms of average (or even p95) query time. If there was a problem with it being
> slow, I would expect gdash to show the same queue rate but lower run rate.

Are you saying that the problem on Wikimedia wikis is another bug? What number?
Comment 5 Aaron Schulz 2012-12-03 22:03:47 UTC
I am getting:
	PHP Notice: Trying to get property of non-object in /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUser.php on line 115

trying to run jobs manually.
Comment 6 Aaron Schulz 2012-12-03 22:08:27 UTC
I added exception details to cli mode and added an exception there:

No row object provided.
Backtrace:
#0 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUserArray.php(59): CentralAuthUser::newFromRow(false)
#1 /home/wikipedia/common/php-1.21wmf5/includes/UserArray.php(78): CentralAuthUserArrayFromResult->setCurrent(Object(stdClass))
#2 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUserArray.php(21): UserArrayFromResult->__construct(Object(ResultWrapper))
#3 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUserArray.php(10): CentralAuthUserArrayFromResult->__construct(Object(ResultWrapper))
#4 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthHooks.php(400): CentralAuthUserArray::newFromResult(Object(ResultWrapper))
#5 [internal function]: CentralAuthHooks::onUserArrayFromResult(NULL, Object(ResultWrapper))
#6 /home/wikipedia/common/php-1.21wmf5/includes/Hooks.php(253): call_user_func_array('CentralAuthHook...', Array)
#7 /home/wikipedia/common/php-1.21wmf5/includes/GlobalFunctions.php(3832): Hooks::run('UserArrayFromRe...', Array)
#8 /home/wikipedia/common/php-1.21wmf5/includes/UserArray.php(30): wfRunHooks('UserArrayFromRe...', Array)
#9 /home/wikipedia/common/php-1.21wmf5/includes/UserArray.php(52): UserArray::newFromResult(Object(ResultWrapper))
#10 /home/wikipedia/common/php-1.21wmf5/includes/UserMailer.php(552): UserArray::newFromIDs(Array)
#11 /home/wikipedia/common/php-1.21wmf5/includes/job/jobs/EnotifNotifyJob.php(53): EmailNotification->actuallyNotifyOnPageChange(Object(User), Object(Title), '20121203220703', '::I have to say...', 0, 526235288, Array)
#12 /home/wikipedia/common/php-1.21wmf5/maintenance/runJobs.php(83): EnotifNotifyJob->run()
#13 /home/wikipedia/common/php-1.21wmf5/maintenance/doMaintenance.php(110): RunJobs->execute()
#14 /home/wikipedia/common/php-1.21wmf5/maintenance/runJobs.php(116): require_once('/home/wikipedia...')
#15 /home/wikipedia/common/multiversion/MWScript.php(69): require_once('/home/wikipedia...')
#16 {main}
Comment 7 Tim Starling 2012-12-04 00:16:37 UTC
(In reply to comment #5)
> I am getting:
>     PHP Notice: Trying to get property of non-object in
> /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUser.php
> on line 115
> 
> trying to run jobs manually.

https://gerrit.wikimedia.org/r/#/c/36683/
Comment 8 Marius Hoch 2012-12-04 00:36:48 UTC
Merged the CentralAuth fix... leaving this one open, as it seems to cover two issues.
Comment 9 Aaron Schulz 2012-12-04 00:40:13 UTC
In any case, the CA thing is a red herring I noticed. More important is that after tailing the job running log, I can see that only 3 of 16 boxes are active. And some processes on those three may be stuck. I think they got gummed up by the itwikisource slowness bug with LST (runners can be stuck on old code even if we fix bugs).
Comment 10 Aaron Schulz 2012-12-04 00:46:02 UTC
(In reply to comment #9)
> In any case, the CA thing is a red herring I noticed. More important is that
> after tailing the job running log, I can see that only 3 of 16 boxes are
> active. And some processes on those three may be stuck. I think they got gummed
> up by the itwikisource slowness bug with LST (runners can be stuck on old code
> even if we fix bugs).

Tim killed those processes and kicked the runners which seems to work per https://gdash.wikimedia.org/dashboards/jobq/.
Comment 11 Aaron Schulz 2012-12-04 00:49:21 UTC
(In reply to comment #3)
> If there was a problem with it being
> slow, I would expect gdash to show the same queue rate but lower run rate.

Ignore that, for the reference, since refreshlinks2 spawns refreshlinks job, which confounds this.
Comment 12 Niklas Laxström 2012-12-04 07:19:27 UTC
Any comments on the behavior I observe?
Comment 13 Nemo 2012-12-04 08:27:23 UTC
Now jobrunners are as active as on average before the 29th, but the job queue is still increasing steadily so this is apparently not enough, they're not consuming the backlog as they usually do.
Could this match the behaviour reported by Niklas?
Comment 14 Joseph Seddon 2012-12-05 10:33:38 UTC
I am wondering if this bug is being caused by the same issues we are seeing in the queue:

https://bugzilla.wikimedia.org/show_bug.cgi?id=42715

Is it possible that the queue is a result of jobs being duplcated?
Comment 15 Niklas Laxström 2012-12-05 10:40:08 UTC
I'm pretty sure nobody can make head or tails of this bug anymore. I reported very specific case where I observe very slow job run rate because job selection is slow due to table scanning.

This issue might contribute to the general issues we are currently seeing with the job queue, but all the other issues discussed in this bug are different from this particular issue.
Comment 16 Jarry1250 2012-12-05 11:24:02 UTC
Niklas: We could test the issue you report by making the queue no random temporarily? Why does it even need to be random in the first place?
Comment 17 Nemo 2012-12-07 12:12:52 UTC
(In reply to comment #1)
> It has now been running overnight:
> php maintenance/showJobs.php 
> 67285

It's still at 67k after 5 more days, by the way.
Comment 18 Richard Guk 2012-12-09 16:08:01 UTC
Is this related to [[mw:Extension:GeoData]] affecting the 819,000 enwiki articles transcluding {{Coord}}, or is the timing coincidental?

The new {{#coordinates:}} parser function was added to the template at 2012-12-06 07:30:18 UTC:
http://en.wikipedia.org/w/index.php?title=Template:Coord&diff=526665541&oldid=368854805
Comment 19 Platonides 2012-12-09 16:13:12 UTC
Those queries _should_ be fast, just an index traversal. :/
Comment 20 Richard Guk 2012-12-09 16:21:51 UTC
(In reply to comment #19)
> Those queries _should_ be fast, just an index traversal. :/

But would still cause a one-off reparse of the affected pages to be sent to the job queue?
Comment 21 Nemo 2012-12-09 16:22:53 UTC
(In reply to comment #18)
> Is this related to [[mw:Extension:GeoData]] affecting the 819,000 enwiki
> articles transcluding {{Coord}}, or is the timing coincidental?

There isn't any coincidence in timing, the jobqueue graph has the same gradient for a whole day after that edit.
Comment 22 Antoine "hashar" Musso (WMF) 2012-12-11 09:28:23 UTC
db39 in production showed JobQueueDB::claimRandom has using 21% of query time:

 /* JobQueueDB::claimRandom */ SELECT * FROM `job` WHERE job_cmd = ? AND job_token = ? AND (job_random <= ?) ORDER BY job_random DESC limit ?

And 8% for:
 /* JobQueueDB::claimRandom */ SELECT * FROM `job` WHERE job_cmd = ? AND job_token = ? AND (job_random >= ?) ORDER BY job_random limit ?


So that sum up to 29%.  Roughly 30% were admin commands, so claimRandom took like half the time of the queries :-/
Comment 23 Nemo 2012-12-12 23:53:01 UTC
(In reply to comment #22)
> So that sum up to 29%.  Roughly 30% were admin commands, so claimRandom took
> like half the time of the queries :-/

Gerrit change #38454.

From #wikimedia-operations:

23.26 < binasher> AaronSchulz: look at the explain on JobQueueDB::claimRandom
23.26 < AaronSchulz> I recall doing that before and not seeing anything interesting
23.26 < binasher> |  1 | SIMPLE      | job   | ref  | job_cmd,job_cmd_token,job_cmd_token_id | job_cmd_token | 291     | const,const | 110700 
                  | Using where |
23.27 < binasher> 110700 = the rows read column
23.29 < apergos> o_O
23.31 < binasher> no wonder the claimRandom calls were killing commons
23.31 < binasher> AaronSchulz: can you drop the order by from that query?
23.32 < MaxSem> mmm, ORDER BY kinda gives us the randomness:)
23.32 < binasher> its not needed for some randomness since the job_random constraint is itself picked randomly
23.32 < binasher> nor is totally random claiming of jobs a requirement
Comment 24 Aaron Schulz 2012-12-13 00:04:10 UTC
I've always found the rows count for EXPLAIN to be useless since it never accounts for LIMIT (unless something change recently). A lot of fast queries (deeply paged history view_ look slow going of that number.

Technically ORDER BY is needed to make sure the DB is not always just getting the top/bottom row. Testing around shows that this is the case. This will lead to more races/retries for claiming rows. I'll see if there is some compromise here.
Comment 25 Aaron Schulz 2012-12-13 00:11:57 UTC
(In reply to comment #24)
> I've always found the rows count for EXPLAIN to be useless since it never
> accounts for LIMIT (unless something change recently). A lot of fast queries
> (deeply paged history view_ look slow going of that number.
> 
> Technically ORDER BY is needed to make sure the DB is not always just getting
> the top/bottom row. Testing around shows that this is the case. This will
> lead
> to more races/retries for claiming rows. I'll see if there is some compromise
> here.

More work can be done to deal with that but this issue can be closed now.
Comment 26 Platonides 2012-12-13 01:30:49 UTC
It makes no sense that mysql needs to read so many rows with an index. Unless the table is really huge or the random numbers aren't really random.

In comment 23 (the expensive query) I see a key_len of 291. That suggests to me that job_cmd is not varbinary(60) in that server but varchar(255).
That change was done in 2007! (f38344aba)

Still, although less efficient, I don't see why that should make the index not be used “correctly“. I wonder if we may have slaves with different job_cmds.
Comment 27 Niklas Laxström 2012-12-13 07:51:01 UTC
Works for me now too, but did we really have to wait 10 days for binasher to raise the issue while this bug was marked as highest priority (also critical for couple of days) to get it fixed? Does not compute.
Comment 28 Antoine "hashar" Musso (WMF) 2012-12-13 12:48:24 UTC
The fix was to remove the order by close as stated by Asher (comment 23) :

https://gerrit.wikimedia.org/r/#/c/38454/

It has been deployed in 1.21wmf5 and 1.21wmf6 https://gerrit.wikimedia.org/r/#/q/I6ba6cc65d0023de3b9d9df0cfcd6a403289c9848,n,z
Comment 29 Nemo 2012-12-13 12:56:09 UTC
Whether related or not, the jobqueue on en.wiki raised from about 50k to 1.08 millions since this bug was reported, but is now slowly catching up a bit since the query was fixed (queues elsewhere such as fr.wiki were truncated).
Comment 30 Nemo 2013-03-04 19:31:24 UTC
Still has to be backported to 1.20 (for 1.20.4; it didn't make it to 1.20.3).
Comment 31 Nemo 2013-06-02 18:37:03 UTC
(In reply to comment #30)
> Still has to be backported to 1.20 (for 1.20.4; it didn't make it to 1.20.3).

I didn't manage to get answers on what would need to be done here, so I'm marking the backport request as rejected.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links