Last modified: 2009-02-01 06:51:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2449, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 449 - uncategorized pages report speedup hack


Summary:	uncategorized pages report speedup hack

Status:	RESOLVED INVALID

Product:	MediaWiki
Classification:	Unclassified
Component:	Special pages (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal with 1 vote (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2004-09-11 02:29 UTC by Jamesday
Modified:	2009-02-01 06:51 UTC (History)
CC List:	1 user (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Jamesday 2004-09-11 02:29:43 UTC

One approach to speed up uncategorized pages. Note 
that it is NOT optimal - ideally there would be an 
index for cur on cur_is_redirect. The lack of that 
forces loading of the full record of cur - if there 
was an index, we could take the value directly from 
the index without reading the main record at all. 
Better still, there would be an index on 
cur_namespace, cur_is_redirect and cur_title. That 
index could be used for many other queries, using 
cur_is_redirect in (0,1) when it doesn't matter 
whether it's a redirect or not. Field order makes a 
speed difference in indexes and I haven't evaluated 
whether cur_namespace or cur_is_redirect first is 
faster.

See bug 250 for a periodically generated intermediate 
table approach which would make this cheaper without 
changing the cur table. This query could run against 
that intermediate table. We'd want one index on cur 
which contained every field in the intermediate table 
or we'd want cur updates to update the intermediate 
table directly (using insert delayed so they wouldn't 
block due to MyISAM poor locking if a query was 
running). With the text field removed from cur we 
wouldn't need this index on cur - it's just to dodge 
the time it takes to read all of the current full cur 
record.

However, we don't have this yet, so:
1. create a table with just the cur_id - will load id 
from that index, so it's very fast. Will include 
records with the wrong namespace. Get them in batches 
of 50,000 and perform steps 2+3 for each batch.
2. join that with categorylinks to find the ones 
which have no category. Also very fast.
3. check the resulting list against cur to eliminate 
the redirects - has to read quite a large chunk of 
cur. This one can be done in small chunks - the 
sample here doesn't bother.
3b repeat 1/2/3 until all batches are in the 
temporary table.
4. finally the query we want

This requires setup and cleanup steps which aren't 
present in the current framework which calls the 
uncategorized pages code in 

I don't recommend implementing this - use the 
intermediate table approach instead. It's a lot less 
messy than this hack.

/* 1. First build a list of cur records in namespace 
0 using the namespace index */
/* can use limit 0,50000 then 51000,100000 etc. to 
meet the batching */
/* requirement discussed later */
/* 607593 for en */
CREATE temporary TABLE cur_namespace0 (
  cur_namespace0_id int(8) unsigned NOT NULL,
  UNIQUE KEY id (cur_namespace0_id)
) TYPE=MyISAM PACK_KEYS=0;
insert into cur_namespace0 (cur_namespace0_id)
select cur_id from cur where cur_namespace=0;

/* 2. now join this with category links to get list 
of cur records which */
/* need to be checked against cur records for not 
is_redirect and to get title */
/* order by cur_id so the subsequent check of cur 
will read cur in */
/* order by the primary/clustered key, the fastest 
way */
CREATE temporary TABLE uncat_nocat (
  uncat_nocat_id int(8) unsigned NOT NULL,
  UNIQUE KEY id (uncat_nocat_id),
) TYPE=MyISAM PACK_KEYS=0;
insert into uncat_nocat (uncat_nocat_id)
SELECT cur_namespace0_id
FROM cur_namespace0 LEFT JOIN categorylinks ON 
cur_namespace0_id=cl_from
WHERE cl_from IS NULL
order by cur_namespace0_id;
drop table cur_namespace0;

/* 3. now we must read cur itself to eliminate the 
redirects */
/* this hurts - 438803 records to check for en. */
/* for en gets 176703 results in 4 min 17 sec on suda 
*/
/* should really do this in batches of 50,000 or so 
uncat_nocat records */
/* uncat_nocat rather than cur because it's OK to 
scan uncat_nocat repeatedly */
/* but we gain nothing by using chunks if we use cur 
chunks - want one */
/* and only one scan of cur, even if we do it in 
pieces */
/* cur_id is used to avoid more costly check of 
cur_title for */
/* duplicates, which we would have to do if it was 
primary key. */
CREATE temporary TABLE nocat (
  nocat_id int(8) unsigned NOT NULL,
  nocat_namespace tinyint(2) unsigned NOT NULL 
default '0',
  nocat_title varchar(255) binary NOT NULL default '',
  UNIQUE KEY id (nocat_id),
  KEY title (nocat_title)
) TYPE=MyISAM PACK_KEYS=0;
insert into nocat (nocat_id, nocat_namespace, 
nocat_title)
SELECT uncat_nocat_id, cur_namespace, cur_title from 
cur, uncat_nocat
WHERE cur_id=uncat_nocat_id and cur_is_redirect=0;
drop table uncat_nocat;

/* 4. finally the query we are really after */
SELECT 'Uncategorizedpages' as type, nocat_namespace 
AS namespace, nocat_title AS title, nocat_title AS 
value
FROM nocat
ORDER BY nocat_title /*limit 50*/;

/* cleanup */
drop table nocat;

Comment 1 Rob Church 2006-01-02 14:35:05 UTC

Bumped: Is this relevant with our new schema?

Comment 2 Chad H. 2009-02-01 06:51:00 UTC

While UncategorizedPages still is a crappy query, absolutely none of the above applies (cur and friends don't even exist anymore). Closing as INVALID.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links