Last modified: 2014-02-12 23:38:33 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 18078 - "Last 10 editors" function fails on PostgreSQL
"Last 10 editors" function fails on PostgreSQL
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
AbuseFilter (Other open bugs)
unspecified
All All
: Low normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: postgres
  Show dependency treegraph
 
Reported: 2009-03-21 01:56 UTC by Brad Jorsch
Modified: 2014-02-12 23:38 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Brad Jorsch 2009-03-21 01:56:17 UTC
PostgreSQL complains about the following query:

  SELECT DISTINCT rev_user_text FROM revision WHERE rev_page = 42 ORDER BY rev_timestamp DESC LIMIT 10

with the following error:

  ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list

This makes sense: Consider if user A edits the page first, then users B to Z edit, then A edits again. Should the above query sort A at the beginning or the end of the list? It seems MySQL only gets the expected result here by chance, BTW: see http://dev.mysql.com/doc/refman/5.0/en/group-by-hidden-columns.html and http://archives.postgresql.org/pgsql-sql/2007-02/msg00169.php for more info. When I was testing this on a simplified table, MySQL was giving the "wrong" answer until I added a "rev_page" column and the page_timestamp index.

Unfortunately, I can't think of an alternate query that won't make MySQL filesort.
Comment 1 Greg Sabino Mullane 2009-03-23 14:22:19 UTC
See $db->implicitGroupby() as a way to write two queries for cases like this.
Comment 2 Brad Jorsch 2009-03-23 18:13:20 UTC
implicitGroupby would only fix the problem here accidentally. The real problem is:

1. The behavior of "SELECT DISTINCT foo FROM baz ORDER BY bar" is not well defined (except in the case that (foo,bar) could be a unique key for the result set). MySQL ignores the undefinedness and arbitrarily chooses the value of bar from the first row fetched, while PostgreSQL throws an error. The SQL standard seems to call for PostgreSQL's behavior, BTW.
2. MySQL filesorts for any well-defined variation of the query that I can think of, for example "SELECT foo, MAX(bar) as max_bar FROM baz GROUP BY foo ORDER BY max_bar". Unless the rules are different for AbuseFilter than for API queries, a query that filesorts will bring the wrath of domas upon us.

As you implied, one possible "fix" for the problem is to continue using the accidentally-working query for MySQL and the correct query for PostgreSQL. implicitGroupby() could act as a "is this MySQL?" flag when we consider only MySQL versus PostgreSQL, but it makes as much sense as using cascadingDeletes(), cleanupTriggers(), strictIPs(), implicitOrderby(), realTimestamps(), searchableIPs(), or functionalIndexes() for the same purpose since the problem has nothing to do with whether the database sorts the result rows to implement GROUP BY. Better IMO would be to explicitly check $wgDBtype == 'mysql', since then it's clearly marked as being a MySQL-specific hack.


BTW, as far as I can tell the reason the query works in MySQL is because it uses the page_timestamp index to fetch the rows, and the "ORDER BY rev_timestamp DESC LIMIT 10" somehow causes it to use the index in reverse order. This makes its arbitrary choice of which rev_timestamp to keep be the maximum timestamp, so the later application of the ORDER BY does what we wanted. If anything changes to make the arbitrary choice not be the maximum rev_timestamp, it will start giving incorrect results. In fact, I found in my testing that simply leaving out the "LIMIT 10" seems to make MySQL use the page_timestamp index in forward order, so the ORDER BY sorts by each user's earliest edit rather than their most recent.
Comment 3 Greg Sabino Mullane 2009-03-23 18:32:31 UTC
Interesting. Well, if this is another MySQLisms, the proper thing to do here (other than finding a way to write this in a standard way that makes MySQL happy) is to create another "implicitGroupBy()" like attribute and set it true for MySQL and false for the rest. This avoids hardcoding the '== mysql' bit and makes future changes easier in the off chance that some other database has the same behavior as MySQL. It would also be great to document this heavily in the code in question, or at least add a pointer to this thread.
Comment 4 Andrew Garrett 2009-03-31 14:57:49 UTC
The other thing that could be done is to just drop the DISTINCT.
Comment 5 Brad Jorsch 2009-03-31 16:55:44 UTC
(In reply to comment #4)
> The other thing that could be done is to just drop the DISTINCT.

That would change how the "Last 10 editors" function works, though. If User:Example is one of those people who doesn't use the preview button, you could easily end up with that returning "Example, Example, Example, Example, Example, Example, Example, Example, Example, Example" rather than Example and 9 other editors.
Comment 6 Andrew Garrett 2009-07-16 17:07:56 UTC
(batch change)

Minor bugs that nevertheless need looking into

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links