Last modified: 2014-05-05 22:43:28 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47266, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 45266 - Incremental indexing (OAI) randomly skips events
Incremental indexing (OAI) randomly skips events
Status: NEW
Product: Wikimedia
Classification: Unclassified
lucene-search-2 (Other open bugs)
unspecified
All All
: Normal major (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-02-22 02:18 UTC by Tim Starling
Modified: 2014-05-05 22:43 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Starling 2013-02-22 02:18:38 UTC
Analysis of the lucene-search-2 code indicates a possible explanation for reports of page updates occasionally being missed, causing the previous version of a page to persist in the index indefinitely.

The main loop of IncrementalUpdater fetches OAI records from MediaWiki, 50 pages at a time. It uses the "from" timestamp parameter to advance through the update list. After each batch of pages, it uses the date from the <responseDate> element as the next value to send to the "from" parameter. This has the following flaws:

* responseDate is the time at which the response is generated. If there is replication lag, the most recent timestamp available in the chosen slave might be some seconds in the past. Thus, a batch of events equivalent to the replication lag will be skipped.

* responseDate and the from parameter have a one-second resolution. The English Wikipedia sees about 5 edits per second at peak. So some events may appear in the database with the same timestamp, after that timestamp has been processed by IncrementalUpdater, because they were committed later in the same second.

* Using the revision timestamp instead of responseDate would be an improvement. However, rev_timestamp and up_timestamp are generated before the transaction is committed, and it is unknown how long it will take for the transaction to be completed, so the order of rev_timestamp or up_timestamp in the replication log will typically not be monotonic. Additionally, the approach would be highly sensitive to apache clock skew.

The obvious solution is to use the sequence number (resumptionToken) to advance through the update list, instead of the timestamp.
Comment 1 Andre Klapper 2013-02-25 19:47:54 UTC
Setting the assignee to Ram, plus changing status to ASSIGNED, as he is working on this.
Comment 3 Munagala Ramanath (Ram) 2013-03-12 00:52:24 UTC
Revised versions of patch which should be compatible with existing clients and servers:
https://gerrit.wikimedia.org/r/#q,I1c4d2d208146e61ccd15975bb412a423849988c8,n,z
https://gerrit.wikimedia.org/r/#q,Ia8d74f82ecf7d4c5c1b612a39fbcef99bcd10334,n,z

Still undergoing testing.
Comment 4 Andre Klapper 2013-03-25 14:10:02 UTC
Ram: Both mentioned patches have been merged. Is something more needed here (if yes: what?), or can this bug report be closed as RESOLVED FIXED?
Comment 5 Munagala Ramanath (Ram) 2013-03-25 14:50:13 UTC
Andre: The merged Lucene code has not yet been deployed; we should wait until that happens and no further issues are reported before closing this.
Comment 6 Andre Klapper 2013-05-14 12:21:38 UTC
(In reply to comment #5)
> Andre: The merged Lucene code has not yet been deployed; we should wait until
> that happens and no further issues are reported before closing this.

Anybody knows where I can track this deployment, or for when this is scheduled?

Tim?
Comment 7 Andre Klapper 2013-07-03 12:16:39 UTC
(In reply to comment #5)
> Andre: The merged Lucene code has not yet been deployed; we should wait until
> that happens and no further issues are reported before closing this.

Anybody knows where I can track this deployment, or for when this is scheduled?

Tim?
Comment 8 Greg Grossmeier 2013-07-03 16:36:28 UTC
Adding Chad and Nik to the cc for their insight.
Comment 9 Nemo 2013-09-30 06:20:06 UTC
Adjusting severity: this is not "normal", at least major (or even "critical" as in practice we don't rebuild the index so that search data may be "lost" for years), however if more work is needed I doubt it's going to happen now that focus is on CirrusSearch.
Comment 10 Nik Everett 2013-09-30 12:35:00 UTC
Do we plan to deploy this?  I only did a quick review of it but at that level it looks sane.
Comment 11 Chad H. 2014-02-13 06:20:53 UTC
The OAI code long since went out. I honestly can't remember if we ever pushed out a fixed lsearchd or not.
Comment 12 Chad H. 2014-02-20 22:25:19 UTC
debian/changelog makes it look like no, we didn't ever push these out to lsearchd.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links