Last modified: 2014-11-10 15:59:12 UTC
The timestamp reported by varnish is taken when the request arrives. The sequence number reported by varnish is taken when the response is sent. So when requests Foo and Bar arrive at the same cache, and Foo arrives slightly before Bar, but Bar get its response sooner, it may occur that the requests get logged as: +--------------+----------+-----------------+ | request name | Time | Sequence number | +--------------+----------+-----------------+ | ... | 09:59:59 | 4708 | | ... | 09:59:59 | 4709 | | Foo | 09:59:59 | 4711 | | Bar | 10:00:00 | 4710 | | ... | 10:00:00 | 4712 | | ... | 10:00:00 | 4713 | +--------------+----------+-----------------+ Since we partition by timestamp, the partition for the 9th hour would look like +--------------+----------+-----------------+ | request name | Time | Sequence number | +--------------+----------+-----------------+ | ... | 09:59:59 | 4708 | | ... | 09:59:59 | 4709 | | Foo | 09:59:59 | 4711 | +--------------+----------+-----------------+ (hence appearing to miss sequence number 4710), and the partition for the 10th hour would look like: +--------------+----------+-----------------+ | request name | Time | Sequence number | +--------------+----------+-----------------+ | Bar | 10:00:00 | 4710 | | ... | 10:00:00 | 4712 | | ... | 10:00:00 | 4713 | +--------------+----------+-----------------+ (hence appearing to miss sequence number 4711). So both partitions look like they'd be missing lines when being looked at in isolation, and our per partition monitoring flags them both as faulty. But when looking at both partitions combined, no line is actually missing, and the monitoring could flag them as ok. In the past 2 weeks, we had two such occasions (one for bits, one for upload). The manual fix is simple: Generate the _SUCCESS file by hand for both partitions. Let's improve our monitoring to be aware of such races and check for them automatically. (One way would be that if the naive validation fails, Oozie's <error to="..."> would no longer go to "kill", but to a follow-up step that check specifically for this race.)
nice find!
Just to help us keep track how relevant this is, it again happened on 2014-08-17T20:xx:xx/2014-08-17T21:xx:xx (on upload) 2014-08-18T20:xx:xx/2014-08-18T21:xx:xx (on upload)
Change 155578 had a related patch set uploaded by QChris: DO NOT MERGE HiveQL file to check for sequence stats across hours https://gerrit.wikimedia.org/r/155578
Change 155578 abandoned by QChris: DO NOT MERGE HiveQL file to check for sequence stats across hours Reason: Just parked code to aid manual checking across hours https://gerrit.wikimedia.org/r/155578
Happened again for: 2014-08-23T16:xx:xx/2014-08-23T17:xx:xx (on upload)
Happened again for: 2014-08-31T15:xx:xx/2014-08-31T16:xx:xx (on upload) 2014-09-02T17:xx:xx/2014-09-02T18:xx:xx (on upload)
Happened again for: 2014-09-04T22:xx:xx/2014-09-04T23:xx:xx (on upload)
Happened again for: 2014-09-28T14:xx:xx/2014-09-28T15:xx:xx (on upload)
Happened again for: 2014-10-09T17:xx:xx/2014-10-09T18:xx:xx (on upload)
This looks like something we should resource. Christian, how does this line up with the other things that you are working on in severity?
In terms of severity and other work, it does not line up too well. And since we're suffering more severe issues, it has not been fixed yet. But since this race-condition a distraction nonetheless, it's one of the things I meant when bringing up “work on cluster infrastructure” during the discussion of goals for this quarter. So, for me, it's also well within scope, and the fix will happen once the more pressing things are done.
ok -- makes sense. Thanks Christian.
Happened again for: 2014-10-21T14/2H (on upload)
Happened again for: 2014-10-22T06/2H (on bits)
Happened again for: 2014-10-24T19/2H (on bits)
Happened again for: 2014-10-29T01/2H (on upload)
Happened again for: 2014-10-29T08/2H (on bits)
Happened again for: 2014-11-05T14/2H (on bits)
Happened again for: 2014-11-08T00/2H (on bits)
Happened again for: 2014-11-08T12/2H (on text) <-- first time on text
Happened again for: 2014-11-09T21/2H (on upload)
Happened again for: 2014-11-10T10/2H (on upload)