Last modified: 2011-03-13 18:06:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 13622 - Create (rev_timestamp,rev_user_text) index
Create (rev_timestamp,rev_user_text) index
Status: RESOLVED WONTFIX
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.13.x
All All
: Lowest enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch, schema-change
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-04-06 12:21 UTC by Roan Kattouw
Modified: 2011-03-13 18:06 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch for tables.sql (680 bytes, patch)
2008-04-07 13:46 UTC, Roan Kattouw
Details

Description Roan Kattouw 2008-04-06 12:21:39 UTC
Currently, the revision table has the usertext_timestamp (rev_user_text,rev_timestamp) index. I was kinda forced to use this for the API's ucuserprefix feature:

SELECT stuff FROM revision LEFT JOIN page ON page_id=rev_page WHERE rev_deleted = 0 AND rev_user_text LIKE '123.456.%' ORDER BY rev_user_text DESC, rev_timestamp DESC

The ORDER BY part is weird, but required to be able to se the usertext_timestamp index. It results in some counterintuitive behavior though: the API sorts contributions by username (Z to A) first, then by timestamp. Sorting the other way around (first by timestamp, then by username) would be more useful and more intuitive, but is currently impossible without killing performance.

I therefore suggest another index be created that sorts the other way, e.g. timestamp_usertext (rev_timestamp, rev_user_text).
Comment 1 Roan Kattouw 2008-04-07 13:46:08 UTC
Created attachment 4789 [details]
Patch for tables.sql
Comment 2 Brion Vibber 2008-04-07 21:59:54 UTC
Hmm, two problems.

First, this is going to be a pretty inefficient query, especially on a site like English Wikipedia where there are many thousands of contributions each day. Since only a tiny handful of them are going to match your user prefix, you have to sort through *every edit to the entire wiki* until you've matched X number of rows on the user portion of the index.

Second, the user portion of this index would only come into play *if there are multiple edits made in the same second from the same username prefix*.

If making this query, you may as well simply use the existing rev_timestamp index -- it'll be more efficient since it's a smaller index, and will give the same results in nearly all cases. But it'll still be very inefficient, because you have to sift through tens or hundreds of thousands of rows to get a handful of matches.
Comment 3 Roan Kattouw 2008-04-08 09:31:32 UTC
(In reply to comment #2)
> Hmm, two problems.
> 
> First, this is going to be a pretty inefficient query, especially on a site
> like English Wikipedia where there are many thousands of contributions each
> day. Since only a tiny handful of them are going to match your user prefix, you
> have to sort through *every edit to the entire wiki* until you've matched X
> number of rows on the user portion of the index.
Isn't that also true for the [[Special:Contributions]] query?

> Second, the user portion of this index would only come into play *if there are
> multiple edits made in the same second from the same username prefix*.
>
> If making this query, you may as well simply use the existing rev_timestamp
> index -- it'll be more efficient since it's a smaller index, and will give the
> same results in nearly all cases. But it'll still be very inefficient, because
> you have to sift through tens or hundreds of thousands of rows to get a handful
> of matches.
That's true, but I thought rev_user_text had to be indexed somewhere too as I'm using it in a WHERE clause (not a DB optimization expert). 

Comment 4 Brion Vibber 2008-04-08 23:40:00 UTC
(In reply to comment #3)
> Isn't that also true for the [[Special:Contributions]] query?

Special:Contributions does only exact username matches, which is efficient -- it hits the first part of the username_timestamp index (the username), then goes on to the timestamp part (the date sorting).

> > If making this query, you may as well simply use the existing rev_timestamp
> > index -- it'll be more efficient since it's a smaller index, and will give the
> > same results in nearly all cases. But it'll still be very inefficient, because
> > you have to sift through tens or hundreds of thousands of rows to get a handful
> > of matches.
> That's true, but I thought rev_user_text had to be indexed somewhere too as I'm
> using it in a WHERE clause (not a DB optimization expert). 

Indexing isn't required; sometimes it helps you, sometimes it doesn't. Where it helps is when you can avoid having to look at a large portion of your data set -- the index lets you skip a lot of rows that won't match.

A timestamp_username index would only help this query on the timestamp part. Since the timestamps differ between almost every edit, you will almost never have the chance to compare the second, user part of that index.
Comment 5 Roan Kattouw 2008-04-09 12:50:35 UTC
(In reply to comment #4)
> Indexing isn't required; sometimes it helps you, sometimes it doesn't. Where it
> helps is when you can avoid having to look at a large portion of your data set
> -- the index lets you skip a lot of rows that won't match.
> 
> A timestamp_username index would only help this query on the timestamp part.
> Since the timestamps differ between almost every edit, you will almost never
> have the chance to compare the second, user part of that index.
> 

OK, so let's generalize: is there any way to optimize or change the current query (pasted below) so it doesn't sort in a counter-intuitive way while not killing server kittens?

SELECT stuff FROM revision WHERE rev_deleted=0 AND rev_user_text LIKE 'Abc%' ORDER BY rev_user_text DESC, rev_timestamp DESC

I'd like to sort this just by rev_timestamp, but that causes a filesort, as does sorting the other way (rev_timestamp before rev_user_text). That's why I thought having a reverse index would help as that would ostensibly allow ORDER BY rev_timestamp DESC, rev_user_text DESC
Comment 6 Brion Vibber 2008-04-09 22:34:13 UTC
I don't believe there's an efficient way to do this, no.

The index you request would theoretically help with a sort BY rev_timestamp DESC, rev_user_text DESC, *however*:

1) It would *not* help with the query -- it would fail to cut down the data set to a manageable size for your WHERE clause, because it cannot optimize the LIKE match

2) The difference in sorting results would rarely, if ever, be any different from sorting BY rev_timestamp DESC alone


Now, if your result set ends up being *small*, then you can just use the user_timestamp index for the matching and do the sorting afterwards relatively efficiently. But if it's large (say, your username prefix matches thousands or tens of thousands or millions of edits), then that's not going to work, so you've got no general solution.

I think to generalize, you'd have to add 254 new indexes: one for each possible length of prefix. Since these would be.... well, kind of *huge* :) it wouldn't be generally feasible.
Comment 7 Roan Kattouw 2008-04-10 09:48:19 UTC
Hmm, that sucks. I think I'm gonna disable ucuserprefix and support CIDR ranges then (because that's what they're used for)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links