Last modified: 2014-10-31 13:42:34 UTC
analytics1021 again got kicked out of it's kafka partition leader role on 2014-10-27 ~07:12. I am not running leader re-elections for now, as ottomata wanted to run some further experiments, if it happens to analytics1021 again.
I ran a leader re-election. Analytics1021 is leader for a few partitions again. (Still pending on check whether leader re-election caused loss/duplicates)
This bug is still missing the numbers of lost messages when analytics1021 lost it's partition leader role. For the text cluster, it only affected amssq34 amssq53.esams.wikimedia.org amssq56.esams.wikimedia.org cp4008.ulsfo.wmnet . The affected period was 2014-10-27T07:12:29/2014-10-27T07:12:32, and in total 100 messages got lost, which is <<1 second worth of data for text. For the upload cluster, it affected all caches in that clustel except for cp4015 . The affected period was 2014-10-27T07:12:29/2014-10-27T07:12:46, and in total ~51K messages got lost, which is <2 second worth of data for upload. When analytics1021 lost its partition leader role, bits, mobile, and text already had the ACK fix. upload hadn't. So seeing the lost messages on upload is expected. It is also expected to see no loss on bits, and mobile. However, I had expected to see no loss on text, as it already had the ACK fix. It's strange to see exactly 100 lost messages on text. 100 is a suspiciously nice number.
(In reply to christian from comment #1) > (Still pending on check whether leader re-election caused loss/duplicates) Bug 72679 has details on that.