Last modified: 2013-09-10 12:09:58 UTC
In trunk, the query run in Export.php in dumpFrom() (used for generating stub history files) is SELECT * FROM `page` INNER JOIN `revision` ON ((page_id=rev_page)) WHERE page_id >= 1157 AND page_id < 1158 ORDER BY page_id ASC; Revisions don't get explicitly ordered. This results in the order changing from one dump to another. Example: el.wiktionary dumps, page name υγεία, page id 1157, revid 1432 timestamp 2005-02-27T15:34:30Z either appears first in the revisions listed in the stubs-meta-history file because it has the earliest timestamp, or 4th because it's 4th if revisions are sorted by revid. Smallest revid for that page is actually 1153 with timestamp 2005-02-27T15:34:45Z. In fact the order seems to be chosen randomly depending on when the search is run: elwiktionary-20100401-stub-meta-history.xml.gz -- revid 1432 is first elwiktionary-20100505-stub-meta-history.xml.gz -- revid 1432 is 4th, 1153 is first elwiktionary-20110123-stub-meta-history.xml.gz -- revid 1432 is first Need to go through the code and make sure every such query has an explicit order for revisions. Also... need to find out why bigger revid has earlier timestamp (since in theory revids get assigned in order as used).
Note that mismatched ordering might really confuse the dump prefetching code, which IIRC assumes a consistent order of (page_id, rev_id). There are many perfectly 100% legit reasons for a later rev_id to have an earlier timestamp, chief among them: * import * restore from deletion (for things deleted prior to rev_id being stored in archive table, at least) as well as mundane things such as server clock skew.
So does anyone on this bug have any objections if I explicitly ORDER BY rev_id ASC at the end of that query? Is that going to kill performance or break some subtle thing in the dumps as they are now?
the explain doesn't look good, yuck: explain SELECT * FROM `page` INNER JOIN `revision` ON ((page_id=rev_page)) WHERE page_id >= 1157 AND page_id < 1158 ORDER BY page_id ASC, revision.rev_id ASC; +----+-------------+----------+-------+------------------------+---------+---------+---------------------+------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------+-------+------------------------+---------+---------+---------------------+------+----------------------------------------------+ | 1 | SIMPLE | page | range | PRIMARY | PRIMARY | 4 | NULL | 1 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | revision | ref | PRIMARY,page_timestamp | PRIMARY | 4 | elwiki.page.page_id | 9 | | +----+-------------+----------+-------+------------------------+---------+---------+---------------------+------+---------------- Prolly need to specify which key or something in here...
I assume the 'yuck' refers to temporary/filesort. Since the ORDER clause contains columns from both tables MySQL can't use an index for ordering. `revisions` has an index on rev_page,rev_id so doing both WHERE and ORDER on rev_* fields gives: explain SELECT * FROM `revision` JOIN `page` ON rev_page=page_id WHERE rev_page >= 1157 and rev_page < 1158 ORDER BY rev_page, rev_id\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: revision type: range possible_keys: PRIMARY,page_timestamp key: PRIMARY key_len: 4 ref: NULL rows: 1 Extra: Using where *************************** 2. row *************************** id: 1 select_type: SIMPLE table: page type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: enwiki.revision.rev_page rows: 1 Extra: Will the page_id range always be so small? Using a larger range causes the query optimizer to switch table join order and fall back on temporary/filesort again. We could force it with STRAIGHT_JOIN but at the risk of hitting far more rows...
It's possible to request a dump by arbitrary page range (and in fact that feature is used), so no, we can't guarantee that the range will be small, unfortunately. We really only need this for the stub dumps; maybe I can tailor the queries in that special case (yuck but less yuck than temporary/filesort).