Last modified: 2014-05-13 09:19:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T26782, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 24782 - API uses non-unique value for paging for some modules
API uses non-unique value for paging for some modules
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
1.17.x
All All
: Normal major with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 34029 35786 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-13 14:42 UTC by Daniel Kinzler
Modified: 2014-05-13 09:19 UTC (History)
14 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
recentchanges (2.86 KB, patch)
2011-06-05 20:39 UTC, Sam Reed (reedy)
Details

Description Daniel Kinzler 2010-08-13 14:42:05 UTC
When listing recentchanges, the API uses rctimestamp for paging, which is not a unique index:

<query-continue>
    <recentchanges rcstart="2010-07-14T14:21:31Z" />
</query-continue>

This is quite bad: say, there where 23 changes in 14:21:39, but the first call only retrieved 10. Using 14:21:39 for rcstart for the next page will return the *same* 10 again, not any of the 134 remaining records. This might even put the client into an infinite loop.  

Using 14:21:40 would not be better: it go to the next timestamp, thus effectively skipping the 13 items we didn not yet read for 14:21:39.

<query-continue> absultely MUST use a unique index. The only unique index for recentchanges is rcid - but is it guaranteed to increase monotonously? Maybe <query-continue> could use a compund index too, like <recentchanges rcstart="2010-07-14T14:21:31Z" rcid="667455"/> ?

The problem is also how to change this in a B/C fashion. In theory, the client should simply use whatever attriubute it finds in <query-continue><recentchanges>. But can we be sure non has the timestamp hard-coded for that? Would it be ok to break clients that rely on rcstart being a timestamp?
Comment 1 Sam Reed (reedy) 2010-08-13 15:37:40 UTC
I don't think we'd use 2 seperate attributes, the normal "api way" is to use either a globally unique thing, or a composite key "id|somefield|anotherfield" or something to that effect
Comment 2 Roan Kattouw 2010-08-13 15:45:08 UTC
(In reply to comment #0)
> <query-continue> absultely MUST use a unique index.
Yes, and the fact that it doesn't is very bad.

> The only unique index for
> recentchanges is rcid - but is it guaranteed to increase monotonously?
I'm not sure.

> Maybe
> <query-continue> could use a compund index too, like <recentchanges
> rcstart="2010-07-14T14:21:31Z" rcid="667455"/> ?
>
That would be the nicest solution, probably, but it requires adding rc_id to the indexes we use (I think there's only two so that wouldn't be too bad). It would be encoded differently though, as rcontinue="2010-07-14T14:21:31Z|667455", to be consistent with other modules paging on multiple fields.
 
> The problem is also how to change this in a B/C fashion. In theory, the client
> should simply use whatever attriubute it finds in
> <query-continue><recentchanges>. But can we be sure non has the timestamp
> hard-coded for that? Would it be ok to break clients that rely on rcstart being
> a timestamp?
Yes, this is definitely OK. Clients are supposed to simply use whatever's in query-continue and pass it back verbatim. Nevertheless, we would still announce it, but not as a breaking change.
Comment 3 Sam Reed (reedy) 2010-08-13 19:58:04 UTC
It's actually a widerspread problem.

Watchlist and blocks (for 2), do the same sort of just timestamp continue.

Category members if sorted by timestamp

Deletedrevs if !( $mode == 'all' || $mode == 'revs' )

Logevents, protected titles

ImageInfo
Comment 4 Bryan Tong Minh 2010-10-24 14:17:46 UTC
(In reply to comment #2)
> > The problem is also how to change this in a B/C fashion. In theory, the client
> > should simply use whatever attriubute it finds in
> > <query-continue><recentchanges>. But can we be sure non has the timestamp
> > hard-coded for that? Would it be ok to break clients that rely on rcstart being
> > a timestamp?
> Yes, this is definitely OK. Clients are supposed to simply use whatever's in
> query-continue and pass it back verbatim. Nevertheless, we would still announce
> it, but not as a breaking change.

It is not necessary to break b/c here. If the new format is "$timestamp|$rcid", then we can make the API simply accept "$timestamp" as well, which would then default to "$timestamp|0"
Comment 5 Sam Reed (reedy) 2011-06-05 20:39:08 UTC
Do we need to worry here that we don't have a rc_timestamp, rc_id index?

I'm presuming, that we shouldn't get that many results simultaneously for one second?

I know adding the index would be teh slow

We could use rc_timestamp, rc_title, but then what if a title gets 2 hits in a second. Bleh
Comment 6 Sam Reed (reedy) 2011-06-05 20:39:52 UTC
Created attachment 8616 [details]
recentchanges

Is this actually even right? I'm getting tired, and can't really think
Comment 7 Sam Reed (reedy) 2011-06-05 20:50:34 UTC
(In reply to comment #3)
> It's actually a widerspread problem.
> 
> Watchlist and blocks (for 2), do the same sort of just timestamp continue.
> 
> Category members if sorted by timestamp
> 
> Deletedrevs if !( $mode == 'all' || $mode == 'revs' )
> 
> Logevents, protected titles
> 
> ImageInfo

ImageInfo is fine (no idea why I said otherwise)

Deletedrevs if !( $mode == 'all' || $mode == 'revs' ) == ($mode == 'user')

Logevents, protected titles

Category members if sorted by timestamp

Watchlist and blocks
Comment 8 Sam Reed (reedy) 2011-06-05 20:59:34 UTC
Blocks, we can just use timestamp and id

Watchlist is essentially the same as recent changes..

Category members if by timestamp....?

Log events use timestamp and id

Protected titles, timestamp and title?

Deleted revs?
Comment 9 Edward Chernenko 2011-08-08 14:39:35 UTC
This is indeed very bad.

I just failed to download a ruwiki recent blocklog because of the adminbot that blocks proxies (>500 per second).

list=logevents API is unusable under these circumstances.
Comment 10 Roan Kattouw 2011-11-20 14:52:03 UTC
(In reply to comment #6)
> Created attachment 8616 [details]
> recentchanges
> 
> Is this actually even right? I'm getting tired, and can't really think
It's got the right idea, but it needs an index on (rc_timestamp, rc_id) or something else unique. Otherwise MySQL's stupidity makes the query slow.
Comment 11 db [inactive,noenotif] 2012-02-18 18:07:12 UTC
*** Bug 34029 has been marked as a duplicate of this bug. ***
Comment 12 db [inactive,noenotif] 2012-04-07 13:59:27 UTC
*** Bug 35786 has been marked as a duplicate of this bug. ***
Comment 13 Sumana Harihareswara 2012-06-06 12:47:27 UTC
Are folks still experiencing this problem?
Comment 14 Daniel Kinzler 2012-09-05 18:12:18 UTC
Fixed the patch a bit and submitted it to gerrit as Icc43b62f
Comment 15 Daniel Kinzler 2012-09-05 18:20:56 UTC
Remaining issues:

As per comment #3 and #8, there are several more API modules with similar problems that still need to be addressed.

As per comment #2, rc_id is not part of the database index currently. Need to investigate whether that is an actual issue.
Comment 16 Brad Jorsch 2013-01-23 17:14:36 UTC
Regarding the indexes, it looks like it's sort of an issue.

Normally, MySQL will insist on filesorting due to the extra field in the ORDER BY, which is certainly an issue for recentchanges because it doesn't seem like it's smart enough to fetch just the first 50 by the index plus only those extra are "tied" for 50th place in the non-unique index.[1]

But with the way InnoDB indexes work, one index (usually the primary key) is effectively appended to every other index[2] and MySQL will take advantage of this when performing queries. So for the recentchanges changes table, since we use InnoDB and the primary key is (rc_id), the rc_timestamp index is secretly (rc_timestamp,rc_id) even though it is defined as just (rc_timestamp). So a query SELECT ... FROM recentchanges WHERE rc_timestamp > '...' ORDER BY rc_timestamp, rc_id LIMIT 50 will not have to filesort under these conditions, and the EXPLAIN output reflects this.

Whether we want to *rely* on this behavior, I don't know.


 [1]: https://dev.mysql.com/doc/refman/5.0/en/limit-optimization.html
 [2]: https://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
Comment 17 Andre Klapper 2013-03-25 09:15:36 UTC
(In reply to comment #14)
> Fixed the patch a bit and submitted it to gerrit as Icc43b62f

Patch committed on March 05th, 2013, removing keyword due to remaining issues in comment 15.
Comment 18 MZMcBride 2013-04-03 02:11:05 UTC
Related: <https://gerrit.wikimedia.org/r/57067>.
Comment 19 Gerrit Notification Bot 2013-12-24 21:00:09 UTC
Change 103589 had a related patch set uploaded by Anomie:
API: Make more continuations unique

https://gerrit.wikimedia.org/r/103589
Comment 20 Gerrit Notification Bot 2014-04-11 18:29:19 UTC
Change 103589 merged by jenkins-bot:
API: Make more continuations unique

https://gerrit.wikimedia.org/r/103589
Comment 21 Brad Jorsch 2014-04-11 19:28:45 UTC
With the merge of Gerrit change #103589, this should be fixed now for all modules in core. If I've missed any, please let me know.

If this bug is present in any extensions, please open a new bug against the relevant extensions.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links