Last modified: 2014-05-13 09:19:18 UTC
When listing recentchanges, the API uses rctimestamp for paging, which is not a unique index: <query-continue> <recentchanges rcstart="2010-07-14T14:21:31Z" /> </query-continue> This is quite bad: say, there where 23 changes in 14:21:39, but the first call only retrieved 10. Using 14:21:39 for rcstart for the next page will return the *same* 10 again, not any of the 134 remaining records. This might even put the client into an infinite loop. Using 14:21:40 would not be better: it go to the next timestamp, thus effectively skipping the 13 items we didn not yet read for 14:21:39. <query-continue> absultely MUST use a unique index. The only unique index for recentchanges is rcid - but is it guaranteed to increase monotonously? Maybe <query-continue> could use a compund index too, like <recentchanges rcstart="2010-07-14T14:21:31Z" rcid="667455"/> ? The problem is also how to change this in a B/C fashion. In theory, the client should simply use whatever attriubute it finds in <query-continue><recentchanges>. But can we be sure non has the timestamp hard-coded for that? Would it be ok to break clients that rely on rcstart being a timestamp?
I don't think we'd use 2 seperate attributes, the normal "api way" is to use either a globally unique thing, or a composite key "id|somefield|anotherfield" or something to that effect
(In reply to comment #0) > <query-continue> absultely MUST use a unique index. Yes, and the fact that it doesn't is very bad. > The only unique index for > recentchanges is rcid - but is it guaranteed to increase monotonously? I'm not sure. > Maybe > <query-continue> could use a compund index too, like <recentchanges > rcstart="2010-07-14T14:21:31Z" rcid="667455"/> ? > That would be the nicest solution, probably, but it requires adding rc_id to the indexes we use (I think there's only two so that wouldn't be too bad). It would be encoded differently though, as rcontinue="2010-07-14T14:21:31Z|667455", to be consistent with other modules paging on multiple fields. > The problem is also how to change this in a B/C fashion. In theory, the client > should simply use whatever attriubute it finds in > <query-continue><recentchanges>. But can we be sure non has the timestamp > hard-coded for that? Would it be ok to break clients that rely on rcstart being > a timestamp? Yes, this is definitely OK. Clients are supposed to simply use whatever's in query-continue and pass it back verbatim. Nevertheless, we would still announce it, but not as a breaking change.
It's actually a widerspread problem. Watchlist and blocks (for 2), do the same sort of just timestamp continue. Category members if sorted by timestamp Deletedrevs if !( $mode == 'all' || $mode == 'revs' ) Logevents, protected titles ImageInfo
(In reply to comment #2) > > The problem is also how to change this in a B/C fashion. In theory, the client > > should simply use whatever attriubute it finds in > > <query-continue><recentchanges>. But can we be sure non has the timestamp > > hard-coded for that? Would it be ok to break clients that rely on rcstart being > > a timestamp? > Yes, this is definitely OK. Clients are supposed to simply use whatever's in > query-continue and pass it back verbatim. Nevertheless, we would still announce > it, but not as a breaking change. It is not necessary to break b/c here. If the new format is "$timestamp|$rcid", then we can make the API simply accept "$timestamp" as well, which would then default to "$timestamp|0"
Do we need to worry here that we don't have a rc_timestamp, rc_id index? I'm presuming, that we shouldn't get that many results simultaneously for one second? I know adding the index would be teh slow We could use rc_timestamp, rc_title, but then what if a title gets 2 hits in a second. Bleh
Created attachment 8616 [details] recentchanges Is this actually even right? I'm getting tired, and can't really think
(In reply to comment #3) > It's actually a widerspread problem. > > Watchlist and blocks (for 2), do the same sort of just timestamp continue. > > Category members if sorted by timestamp > > Deletedrevs if !( $mode == 'all' || $mode == 'revs' ) > > Logevents, protected titles > > ImageInfo ImageInfo is fine (no idea why I said otherwise) Deletedrevs if !( $mode == 'all' || $mode == 'revs' ) == ($mode == 'user') Logevents, protected titles Category members if sorted by timestamp Watchlist and blocks
Blocks, we can just use timestamp and id Watchlist is essentially the same as recent changes.. Category members if by timestamp....? Log events use timestamp and id Protected titles, timestamp and title? Deleted revs?
This is indeed very bad. I just failed to download a ruwiki recent blocklog because of the adminbot that blocks proxies (>500 per second). list=logevents API is unusable under these circumstances.
(In reply to comment #6) > Created attachment 8616 [details] > recentchanges > > Is this actually even right? I'm getting tired, and can't really think It's got the right idea, but it needs an index on (rc_timestamp, rc_id) or something else unique. Otherwise MySQL's stupidity makes the query slow.
*** Bug 34029 has been marked as a duplicate of this bug. ***
*** Bug 35786 has been marked as a duplicate of this bug. ***
Are folks still experiencing this problem?
Fixed the patch a bit and submitted it to gerrit as Icc43b62f
Remaining issues: As per comment #3 and #8, there are several more API modules with similar problems that still need to be addressed. As per comment #2, rc_id is not part of the database index currently. Need to investigate whether that is an actual issue.
Regarding the indexes, it looks like it's sort of an issue. Normally, MySQL will insist on filesorting due to the extra field in the ORDER BY, which is certainly an issue for recentchanges because it doesn't seem like it's smart enough to fetch just the first 50 by the index plus only those extra are "tied" for 50th place in the non-unique index.[1] But with the way InnoDB indexes work, one index (usually the primary key) is effectively appended to every other index[2] and MySQL will take advantage of this when performing queries. So for the recentchanges changes table, since we use InnoDB and the primary key is (rc_id), the rc_timestamp index is secretly (rc_timestamp,rc_id) even though it is defined as just (rc_timestamp). So a query SELECT ... FROM recentchanges WHERE rc_timestamp > '...' ORDER BY rc_timestamp, rc_id LIMIT 50 will not have to filesort under these conditions, and the EXPLAIN output reflects this. Whether we want to *rely* on this behavior, I don't know. [1]: https://dev.mysql.com/doc/refman/5.0/en/limit-optimization.html [2]: https://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
(In reply to comment #14) > Fixed the patch a bit and submitted it to gerrit as Icc43b62f Patch committed on March 05th, 2013, removing keyword due to remaining issues in comment 15.
Related: <https://gerrit.wikimedia.org/r/57067>.
Change 103589 had a related patch set uploaded by Anomie: API: Make more continuations unique https://gerrit.wikimedia.org/r/103589
Change 103589 merged by jenkins-bot: API: Make more continuations unique https://gerrit.wikimedia.org/r/103589
With the merge of Gerrit change #103589, this should be fixed now for all modules in core. If I've missed any, please let me know. If this bug is present in any extensions, please open a new bug against the relevant extensions.