Last modified: 2013-09-10 12:09:58 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29112, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 27112 - select of revisions for stub history files does not explicitly order revisions
select of revisions for stub history files does not explicitly order revisions
Status: NEW
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal minor (vote)
: ---
Assigned To: Ariel T. Glenn
: analytics
Depends on:
Blocks: 27110
  Show dependency treegraph
 
Reported: 2011-02-02 19:24 UTC by Ariel T. Glenn
Modified: 2013-09-10 12:09 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Ariel T. Glenn 2011-02-02 19:24:37 UTC
In trunk, the query run in Export.php in dumpFrom()  (used for generating stub history files) is

SELECT  *  FROM `page` INNER JOIN `revision` ON ((page_id=rev_page))  WHERE page_id >= 1157 AND page_id < 1158  ORDER BY page_id ASC;

Revisions don't get explicitly ordered.  This results in the order changing from one dump to another. 

Example:

el.wiktionary dumps, page name υγεία, page id 1157, revid 1432 timestamp 2005-02-27T15:34:30Z  either appears first in the revisions listed in the stubs-meta-history file because it has the earliest timestamp, or 4th because it's 4th if revisions are sorted by revid. 

Smallest revid for that page is actually 1153 with timestamp 2005-02-27T15:34:45Z.

In fact the order seems to be chosen randomly depending on when the search is run:

elwiktionary-20100401-stub-meta-history.xml.gz  -- revid 1432 is first
elwiktionary-20100505-stub-meta-history.xml.gz  -- revid 1432 is 4th, 1153 is first
elwiktionary-20110123-stub-meta-history.xml.gz -- revid 1432 is first

Need to go through the code and make sure every such query has an explicit
order for revisions.

Also... need to find out why bigger revid has earlier timestamp (since in theory revids get assigned in order as used).
Comment 1 Brion Vibber 2011-02-02 19:28:14 UTC
Note that mismatched ordering might really confuse the dump prefetching code, which IIRC assumes a consistent order of (page_id, rev_id).


There are many perfectly 100% legit reasons for a later rev_id to have an earlier timestamp, chief among them:
* import
* restore from deletion (for things deleted prior to rev_id being stored in archive table, at least)

as well as mundane things such as server clock skew.
Comment 2 Ariel T. Glenn 2013-09-09 15:25:49 UTC
So does anyone on this bug have any objections if I explicitly ORDER BY rev_id ASC at the end of that query? Is that going to kill performance or break some subtle thing in the dumps as they are now?
Comment 3 Ariel T. Glenn 2013-09-09 15:33:34 UTC
the explain doesn't look good, yuck:

explain SELECT  *  FROM `page` INNER JOIN `revision` ON ((page_id=rev_page))  WHERE page_id >= 1157 AND page_id < 1158  ORDER BY page_id ASC, revision.rev_id ASC;
+----+-------------+----------+-------+------------------------+---------+---------+---------------------+------+----------------------------------------------+
| id | select_type | table    | type  | possible_keys          | key     | key_len | ref                 | rows | Extra                                        |
+----+-------------+----------+-------+------------------------+---------+---------+---------------------+------+----------------------------------------------+
|  1 | SIMPLE      | page     | range | PRIMARY                | PRIMARY | 4       | NULL                |    1 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | revision | ref   | PRIMARY,page_timestamp | PRIMARY | 4       | elwiki.page.page_id |    9 |                                              |
+----+-------------+----------+-------+------------------------+---------+---------+---------------------+------+----------------

Prolly need to specify which key or something in here...
Comment 4 Sean Pringle 2013-09-09 18:43:03 UTC
I assume the 'yuck' refers to temporary/filesort. Since the ORDER clause contains columns from both tables MySQL can't use an index for ordering.

`revisions` has an index on rev_page,rev_id so doing both WHERE and ORDER on rev_* fields gives:

explain SELECT  *  FROM `revision`  JOIN `page` ON rev_page=page_id WHERE rev_page >= 1157 and rev_page < 1158  ORDER BY rev_page, rev_id\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: range
possible_keys: PRIMARY,page_timestamp
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 1
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: enwiki.revision.rev_page
         rows: 1
        Extra:

Will the page_id range always be so small?

Using a larger range causes the query optimizer to switch table join order and fall back on temporary/filesort again.  We could force it with STRAIGHT_JOIN but at the risk of hitting far more rows...
Comment 5 Ariel T. Glenn 2013-09-10 12:09:58 UTC
It's possible to request a dump by arbitrary page range (and in fact that feature is used), so no, we can't guarantee that the range will be small, unfortunately.

We really only need this for the stub dumps; maybe I can tailor the queries in that special case (yuck but less yuck than temporary/filesort).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links