Last modified: 2011-03-13 18:04:32 UTC
When you edit a template, the job-queue gets filled. If you edit the same template multiple times within a short period of time, the job-queue will have identical entries. This does not result in an error, but in unnecessary database load. I propose, the job table should be checked, if there are already entries with the same "job_cmd, job_namespace and job_title". Since the job queue is updated after the edit is finished, there should not be a problem with entries that do exist during check, but are already completed on update of the table. The benefit would be: * less db load * faster updates on long queues, because they are considerably shorter While writing this wikipedia en has a jobqueue of 82,305 entries, so I make it a high priority bug.
I am sorry, I just found, that the cleanup-jobs seems to check this. So it will delete all similar jobs with just one step. Thus the DB-load is not significantly higher than if it would have been not inserted in the first place. The only drawback is, that you do not see the actual length of the jobqueue. One could make a group statement in SpecialStatistics.php.
One would need to add: = $numJobs = $dbr->selectField( 'job', 'COUNT(*)', '', $fname ); + $numJobsGrouped = $dbr->selectField( 'job', 'count(DISTINCT `job_title`,`job_cmd`,`job_namespace`)', '', $fname); = $wgLang->formatNum( $images ), + $wgLang->formatNum( $numJobsGrouped ) in SpecialStatistics.php and add some text to the Message: Sitestatstext I am not sure about the database load of count DISTINCT on large Systems, so it might not be a good idea. Another possible SELECT would be SELECT COUNT(*) AS C FROM `job` WHERE `job_id` IN (SELECT (`job_id`) FROM `job` GROUP BY `job_cmd`,`job_namespace`, `job_title`).
We could just add a unique index on those three columns and use an INSERT IGNORE when stuffing rows into the job queue, but I'd like another opinion on whether or not the duplicates are, in fact, causing load that we need to be worried about.
The duplicates are used because the original checking on add was very expensive (the inserts must be very fast, while the processing can take as long as it needs). An INSERT IGNORE might not do too bad, though, dunno.
I didn't use a unique index in the original code because I imagined that at some stage in the future, we may want to add job types that require execution of duplicates. For example, a job type with no attached title, defined entirely by the last few bytes of a large job_params blob, would create duplicates in a (job_cmd,job_namespace,job_title) key. The current method is good enough for now, although I would like to switch to a specialised non-MySQL data structure at some stage. -- Tim Starling