Last modified: 2014-10-31 12:53:16 UTC
Between, 2014-09-28T18:31:10 and 2014-09-28T20:06:34 all esams bits caches saw both duplicate and missing lines. Looking at the Ganglia graphs, it seems we'll see the same issue also for today (2014-09-29). While the issue was going on today, there was a discussion about it in IRC [1]. It is not clear what happened. The theory up to now is that due to recent config changes around varnishkafka, esams bits traffic can no longer be handled with 3 brokers (we're currently using only 3 out of 4 brokers). [1] Starting at 19:04:03 at http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140929.txt
(In reply to christian from comment #0) > Looking at the Ganglia graphs, it seems we'll see the same issue also > for today (2014-09-29). Yes, we did. The affected period is 2014-09-29T18:41:48--2014-09-29T19:55:21. Again, only all esams bits caches. Again, both duplicate and missing lines. Ottomata restarted varnishkafka on cp3019 on 19:41, and cp3019 immediately recovered. Its queues being back to normal, and no longer getting critical again. No more losses on cp3019. This nicely matches yesterday's theory of esams bits traffic spikes are above what 3 brokers can take.
It happened again during for the 5 bits partitions from 2014-10-14T16:xx:xx up to and including 2014-10-14T20:xx:xx. Again only esams bits. Since I've been around when it happened, and historic ganglia graphs don't expose this: The kafka drerr's were not constant, but grew, died off again, and stayed off for the rest of the interval in intervals that were ~25-minutes long. (See attachment kafka.varnishkafka.kafka_drerr.per_second-2014-10-15.png) All affected caches showed this ~25-minutes long pattern. But the pattern was not synchronous across machines. While the drerrs showed this pattern, the outbuf_cnt did not show such a pattern. It was high the whole time.
Created attachment 16773 [details] kafka.varnishkafka.kafka_drerr.per_second-2014-10-15.png
It happened again for 2014-10-16T17:xx:xx up to and including 2014-10-16T19:xx:xx