Last modified: 2009-02-01 06:51:00 UTC
One approach to speed up uncategorized pages. Note that it is NOT optimal - ideally there would be an index for cur on cur_is_redirect. The lack of that forces loading of the full record of cur - if there was an index, we could take the value directly from the index without reading the main record at all. Better still, there would be an index on cur_namespace, cur_is_redirect and cur_title. That index could be used for many other queries, using cur_is_redirect in (0,1) when it doesn't matter whether it's a redirect or not. Field order makes a speed difference in indexes and I haven't evaluated whether cur_namespace or cur_is_redirect first is faster. See bug 250 for a periodically generated intermediate table approach which would make this cheaper without changing the cur table. This query could run against that intermediate table. We'd want one index on cur which contained every field in the intermediate table or we'd want cur updates to update the intermediate table directly (using insert delayed so they wouldn't block due to MyISAM poor locking if a query was running). With the text field removed from cur we wouldn't need this index on cur - it's just to dodge the time it takes to read all of the current full cur record. However, we don't have this yet, so: 1. create a table with just the cur_id - will load id from that index, so it's very fast. Will include records with the wrong namespace. Get them in batches of 50,000 and perform steps 2+3 for each batch. 2. join that with categorylinks to find the ones which have no category. Also very fast. 3. check the resulting list against cur to eliminate the redirects - has to read quite a large chunk of cur. This one can be done in small chunks - the sample here doesn't bother. 3b repeat 1/2/3 until all batches are in the temporary table. 4. finally the query we want This requires setup and cleanup steps which aren't present in the current framework which calls the uncategorized pages code in I don't recommend implementing this - use the intermediate table approach instead. It's a lot less messy than this hack. /* 1. First build a list of cur records in namespace 0 using the namespace index */ /* can use limit 0,50000 then 51000,100000 etc. to meet the batching */ /* requirement discussed later */ /* 607593 for en */ CREATE temporary TABLE cur_namespace0 ( cur_namespace0_id int(8) unsigned NOT NULL, UNIQUE KEY id (cur_namespace0_id) ) TYPE=MyISAM PACK_KEYS=0; insert into cur_namespace0 (cur_namespace0_id) select cur_id from cur where cur_namespace=0; /* 2. now join this with category links to get list of cur records which */ /* need to be checked against cur records for not is_redirect and to get title */ /* order by cur_id so the subsequent check of cur will read cur in */ /* order by the primary/clustered key, the fastest way */ CREATE temporary TABLE uncat_nocat ( uncat_nocat_id int(8) unsigned NOT NULL, UNIQUE KEY id (uncat_nocat_id), ) TYPE=MyISAM PACK_KEYS=0; insert into uncat_nocat (uncat_nocat_id) SELECT cur_namespace0_id FROM cur_namespace0 LEFT JOIN categorylinks ON cur_namespace0_id=cl_from WHERE cl_from IS NULL order by cur_namespace0_id; drop table cur_namespace0; /* 3. now we must read cur itself to eliminate the redirects */ /* this hurts - 438803 records to check for en. */ /* for en gets 176703 results in 4 min 17 sec on suda */ /* should really do this in batches of 50,000 or so uncat_nocat records */ /* uncat_nocat rather than cur because it's OK to scan uncat_nocat repeatedly */ /* but we gain nothing by using chunks if we use cur chunks - want one */ /* and only one scan of cur, even if we do it in pieces */ /* cur_id is used to avoid more costly check of cur_title for */ /* duplicates, which we would have to do if it was primary key. */ CREATE temporary TABLE nocat ( nocat_id int(8) unsigned NOT NULL, nocat_namespace tinyint(2) unsigned NOT NULL default '0', nocat_title varchar(255) binary NOT NULL default '', UNIQUE KEY id (nocat_id), KEY title (nocat_title) ) TYPE=MyISAM PACK_KEYS=0; insert into nocat (nocat_id, nocat_namespace, nocat_title) SELECT uncat_nocat_id, cur_namespace, cur_title from cur, uncat_nocat WHERE cur_id=uncat_nocat_id and cur_is_redirect=0; drop table uncat_nocat; /* 4. finally the query we are really after */ SELECT 'Uncategorizedpages' as type, nocat_namespace AS namespace, nocat_title AS title, nocat_title AS value FROM nocat ORDER BY nocat_title /*limit 50*/; /* cleanup */ drop table nocat;
Bumped: Is this relevant with our new schema?
While UncategorizedPages still is a crappy query, absolutely none of the above applies (cur and friends don't even exist anymore). Closing as INVALID.