Last modified: 2014-09-23 23:31:39 UTC
Created attachment 5066 [details] simplify query for faster performance Updates for Special:WantedPage have been disabled on en.wikibooks. I tried asking on #mediawiki-tech, yesterday and was told it was probably due to being inefficient. Since updates to Special:WantedCategories are still being allowed currently, I propose simplifying Special:WantedPages since that would appear to be the reason. I have included an attached patch which I believe does simplify the query and reduces the performance impact, without reducing functionality.
The patch 1) functionally changes the query, apparently unintentionally; and 2) includes tons of completely unrelated changes that I'm not even going to look at. It might improve performance, but only by accident and by introducing bugs. This query should be okay with correct join order. On MySQL 5.x I'm seeing that with STRAIGHT_JOIN, it's *************************** 1. row *************************** id: 1 select_type: SIMPLE table: pagelinks type: index possible_keys: pl_from,pl_namespace key: pl_namespace key_len: 265 ref: NULL rows: 254989417 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: page type: eq_ref possible_keys: name_title key: name_title key_len: 261 ref: enwiki.pagelinks.pl_namespace,enwiki.pagelinks.pl_title rows: 1 Extra: Using where; Using index *************************** 3. row *************************** id: 1 select_type: SIMPLE table: page type: eq_ref possible_keys: PRIMARY,name_title key: PRIMARY key_len: 4 ref: enwiki.pagelinks.pl_from rows: 1 Extra: Using where So just a table scan of pagelinks, not a filesort. This should run in under 15 minutes even on enwiki, hopefully, and surely on enwikibooks. IIRC, that was the criterion Tim used for shutting off queries. I'm trying it out on the toolserver now. Without the STRAIGHT_JOIN, as now, MySQL 5 tries to put pg2 first in the join order, for some ungodly reason.
Oops, I forgot about the ORDER BY. That adds the filesort again. If update_specialpages.php could be rewritten to write an unsorted result to disk, close the transaction, sort the result on disk, and cache that, it wouldn't have to keep open horribly long database transactions and might be more useful. Even with the order by and without the straight join, actually, this runs in less than a minute on the toolserver for enwikibooks. Was this really taking a long time on the real servers? This query takes 16.90s on toolserver's enwikibooks_p: EXPLAIN SELECT 'Wantedpages' as type, pl_namespace AS namespace, pl_title AS title, COUNT(*) AS value FROM pagelinks LEFT JOIN page AS pg1 ON pl_namespace = pg1.page_namespace AND pl_title = pg1.page_title LEFT JOIN page AS pg2 ON pl_from = pg2.page_id WHERE pg1.page_namespace IS NULL AND pl_namespace NOT IN ( 2, 3 ) AND pg2.page_namespace != 8 GROUP BY 1,2,3 HAVING COUNT(*) > 0 ORDER BY value DESC LIMIT 5000; Maybe MySQL 4 is optimizing it poorly?
Isn't MediaWiki suppose to work with both MySQL and PostgreSQL? I don't think PostgreSQL supports STRAIGHT_JOIN, which is why I didn't use that.
MySQL conditional comments can be used to hide MySQL-specific things: http://dev.mysql.com/doc/refman/4.1/en/comments.html The abstraction layer can also be used. Either way, it's not a problem.
Fwiw: Removing the second JOIN to remove links from MW messages seems to get rid of the filesort. I know that was added (bug 5723) to prevent listing of $1, $2 in messages (which showed up as wanted). However, removing the join doesn't seem to be adding these fake links to the results. Has something else changed to keep them out of link tables, rendering this extra join useless?
Removing the extra join would serve more or less the same purpose as using STRAIGHT_JOIN. Both should remove the filesort, if what I said above is correct.
Bug 15434 related.
-patch since there is currently no patch.