Last modified: 2014-05-05 22:43:28 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47266, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 45266 - Incremental indexing (OAI) randomly skips events


Summary:	Incremental indexing (OAI) randomly skips events

Status:	NEW

Product:	Wikimedia
Classification:	Unclassified
Component:	lucene-search-2 (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal major (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-02-22 02:18 UTC by Tim Starling
Modified:	2014-05-05 22:43 UTC (History)
CC List:	8 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Tim Starling 2013-02-22 02:18:38 UTC

Analysis of the lucene-search-2 code indicates a possible explanation for reports of page updates occasionally being missed, causing the previous version of a page to persist in the index indefinitely.

The main loop of IncrementalUpdater fetches OAI records from MediaWiki, 50 pages at a time. It uses the "from" timestamp parameter to advance through the update list. After each batch of pages, it uses the date from the <responseDate> element as the next value to send to the "from" parameter. This has the following flaws:

* responseDate is the time at which the response is generated. If there is replication lag, the most recent timestamp available in the chosen slave might be some seconds in the past. Thus, a batch of events equivalent to the replication lag will be skipped.

* responseDate and the from parameter have a one-second resolution. The English Wikipedia sees about 5 edits per second at peak. So some events may appear in the database with the same timestamp, after that timestamp has been processed by IncrementalUpdater, because they were committed later in the same second.

* Using the revision timestamp instead of responseDate would be an improvement. However, rev_timestamp and up_timestamp are generated before the transaction is committed, and it is unknown how long it will take for the transaction to be completed, so the order of rev_timestamp or up_timestamp in the replication log will typically not be monotonic. Additionally, the approach would be highly sensitive to apache clock skew.

The obvious solution is to use the sequence number (resumptionToken) to advance through the update list, instead of the timestamp.

Comment 1 Andre Klapper 2013-02-25 19:47:54 UTC

Setting the assignee to Ram, plus changing status to ASSIGNED, as he is working on this.

Comment 2 jeremyb 2013-02-28 03:57:50 UTC

I07d5c2cedcd7550505b53d380d13111bd83e3216 I31bde27a7a64e0d9fff340843f56fe6c6d8a322a

Comment 3 Munagala Ramanath (Ram) 2013-03-12 00:52:24 UTC

Revised versions of patch which should be compatible with existing clients and servers:
https://gerrit.wikimedia.org/r/#q,I1c4d2d208146e61ccd15975bb412a423849988c8,n,z
https://gerrit.wikimedia.org/r/#q,Ia8d74f82ecf7d4c5c1b612a39fbcef99bcd10334,n,z

Still undergoing testing.

Comment 4 Andre Klapper 2013-03-25 14:10:02 UTC

Ram: Both mentioned patches have been merged. Is something more needed here (if yes: what?), or can this bug report be closed as RESOLVED FIXED?

Comment 5 Munagala Ramanath (Ram) 2013-03-25 14:50:13 UTC

Andre: The merged Lucene code has not yet been deployed; we should wait until that happens and no further issues are reported before closing this.

Comment 6 Andre Klapper 2013-05-14 12:21:38 UTC

(In reply to comment #5)
> Andre: The merged Lucene code has not yet been deployed; we should wait until
> that happens and no further issues are reported before closing this.

Anybody knows where I can track this deployment, or for when this is scheduled?

Tim?

Comment 7 Andre Klapper 2013-07-03 12:16:39 UTC

(In reply to comment #5)
> Andre: The merged Lucene code has not yet been deployed; we should wait until
> that happens and no further issues are reported before closing this.

Anybody knows where I can track this deployment, or for when this is scheduled?

Tim?

Comment 8 Greg Grossmeier 2013-07-03 16:36:28 UTC

Adding Chad and Nik to the cc for their insight.

Comment 9 Nemo 2013-09-30 06:20:06 UTC

Adjusting severity: this is not "normal", at least major (or even "critical" as in practice we don't rebuild the index so that search data may be "lost" for years), however if more work is needed I doubt it's going to happen now that focus is on CirrusSearch.

Comment 10 Nik Everett 2013-09-30 12:35:00 UTC

Do we plan to deploy this?  I only did a quick review of it but at that level it looks sane.

Comment 11 Chad H. 2014-02-13 06:20:53 UTC

The OAI code long since went out. I honestly can't remember if we ever pushed out a fixed lsearchd or not.

Comment 12 Chad H. 2014-02-20 22:25:19 UTC

debian/changelog makes it look like no, we didn't ever push these out to lsearchd.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links