Last modified: 2010-05-15 15:33:44 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3474, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 1474 - /* wfSpecialStatistics */ SELECT COUNT(cur_id) AS total FROM `cur`
/* wfSpecialStatistics */ SELECT COUNT(cur_id) AS total FROM `cur`
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Special pages (Other open bugs)
1.4.x
All All
: Normal normal with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-05 22:00 UTC by Jamesday
Modified: 2010-05-15 15:33 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
REL1_4 patch for SpecialStatistics.php (3.04 KB, patch)
2005-02-06 06:42 UTC, Zigger
Details
HEAD patch for SpecialStatistics.php (2.27 KB, patch)
2005-02-06 06:44 UTC, Zigger
Details

Description Jamesday 2005-02-05 22:00:09 UTC
This causes a full table scan of several gigabytes of data, 
much of which isn't cached, and regularly takes more than a 
minute for de and hundreds of seconds for en. Please change 
it to use the query below, which uses an index which is 
normally very well cached because it's used for lots of 
article lookups.

select count(cur_namespace) from cur;
+----------------------+
| count(cur_namespace) |
+----------------------+
|              1292372 |
+----------------------+
1 row in set (6.29 sec)

mysql> explain select count(cur_namespace) from cur;
+-------+-------+---------------+----------------------+-----
----+------+---------+-------------+
| table | type  | possible_keys | key                  | 
key_len | ref  | rows    | Extra       |
+-------+-------+---------------+----------------------+-----
----+------+---------+-------------+
| cur   | index | NULL          | name_title_timestamp |     
270 | NULL | 1444561 | Using index |
+-------+-------+---------------+----------------------+-----
----+------+---------+-------------+
1 row in set (0.00 sec)

Next run takes 0.62 seconds. Current code causes results 
like this on bacon:

 1548197  wikiuser    igoeje:53107     enwiki       402  
Query /* wfSpecialStatistics */ SELECT COUNT(cur_id) AS 
total FROM `cur`

or this on holbach:

23994523  wikiuser   smellie:33404     dewiki       112  
Query /* wfSpecialStatistics */ SELECT COUNT(cur_id) AS 
total FROM `cur`

Because of the long run time and high disk activity these 
queries are also a regular cause of replication lag. It's 
also fairly common to see several of these running at the 
same time for wikis like en and de.

The smaller cur records for 1.5 will help these queries but 
it'll still be better to have this index used.
Comment 1 Zigger 2005-02-06 00:49:14 UTC
I have a patch for SpecialStatistics (that I had forgotten to upload) which uses
higher-level methods of the database class, removes some unnecessary code, but
used COUNT(*) based on some mysql doco claiming that as the optimised form for
MyISAM and ISAM tables and removing the need to check for NULL values.  Do you
have a timing for COUNT(*)?
Comment 2 Jamesday 2005-02-06 04:21:20 UTC
Count(*) is the optimised form for MyISAM and ISAM table types. In 
the case of those non-transactional table types, the value is stored 
and immediately available if you use count(*). That is, it's 
available in a fraction of a second.

Wikimedia uses the InnoDB table type. So should other MediaWiki 
users, when possible, because of the far better crash recovery and 
the transaction support, which isn't available in MyISAM.

explain select count(*) from cur;
+-------+-------+---------------+--------+---------+------+---------+-
------------+
| table | type  | possible_keys | key    | key_len | ref  | rows    | 
Extra       |
+-------+-------+---------------+--------+---------+------+---------+-
------------+
| cur   | index | NULL          | cur_id |       4 | NULL | 1463251 | 
Using index |
+-------+-------+---------------+--------+---------+------+---------+-
------------+
1 row in set (0.00 sec)

As you can see, count(*) will use the primary key, which is the 
clustered key in InnoDB and so forces the same full table scan.

Optimal form for InnoDB is a count using an index which is already in 
cache or is mostly in cache. That's the namespace, title and 
timestamp index,which is routinely used for many common queries.

Caution on the MySQL manual: it is often written assuming MyISAM and 
often does not say what is different for innoDB. For InnoDB:

'InnoDB does not keep an internal count of rows in a table. (This 
would actually be somewhat complicated because of multi-versioning.) 
To process a SELECT COUNT(*) FROM T statement, InnoDB must scan an 
index of the table, which will take some time if the table is not 
entirely in the buffer pool. To get a fast count, you have to use a 
counter table you create yourself and let your application update it 
according to the inserts and deletes it does. If your table does not 
change often, using the MySQL query cache is a good solution. SHOW 
TABLE STATUS also can be used if an approximate row count is 
sufficient. See Section 15.12, “InnoDB Performance Tuning Tips”.'

The "the table is not entirely in the buffer pool" part is in error: 
it's the index which must be in the buffer pool, not the table, when 
scanning an index. I've submitted the correction.

There's already a table for statistics and any statistics reporting 
code should ideally be just asking an object for the count, with the 
object using and occasionally updating the statistics table. Perhaps 
once every hour or every ten minutes. But for now, just changing the 
variable will be a major improvment.

See http://dev.mysql.com/doc/mysql/en/innodb-restrictions.html
Comment 3 Zigger 2005-02-06 06:42:23 UTC
Created attachment 268 [details]
REL1_4 patch for SpecialStatistics.php

Optimises string quotes, uses higher-level database class methods, removes
misplaced call to getSkin(), optimises implicit db index usage by doing count
on columns that are not the primary index where possible.  Simple test was done
locally.
Comment 4 Zigger 2005-02-06 06:44:12 UTC
Created attachment 269 [details]
HEAD patch for SpecialStatistics.php

Uses higher-level database class methods, removes misplaced call to getSkin(),
optimises implicit db index usage by doing count on columns that are not the
primary index where possible.  Simple test was done locally.
Comment 5 Jamesday 2005-02-06 19:56:56 UTC
Those look good. Thanks.

One extra note on head, though it also applies to 1.4: LIKE '%
anything' alaways causes a full table scan. On the "when time is 
available to do list" should be something like different rows in 
user_rights for each capability with ur_user, ur_right as the primary 
key. Then the leading % can be dropped and it'll be a much faster 
indexed query. Shorter term goodness would be forcing sysop to be the 
first user right if present, then scanning for 'sysop%'. The request 
for lists of sysops is frequent enough that it's worth some work to 
make that particular form use an index.
Comment 6 Brion Vibber 2005-02-07 06:23:01 UTC
Changed in REL1_4 and HEAD to use COUNT(cur_name)/COUNT(page_name) 
on James's suggestion.

I'm not applying the patch, as it makes unnecessary changes to use the new 
functions without fitting the existing coding style for use of those functions.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links