Last modified: 2014-10-29 17:08:55 UTC
For the hour 2014-10-20T02:xx:xx, none [1] of the the four sources' bucket was marked successful. What happened? [1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 10:29:22 // exit code: 0 cwd: ~ ~/cluster-scripts/dump_webrequest_status.sh +---------------------+--------+--------+--------+--------+ | Date | bits | mobile | text | upload | +---------------------+--------+--------+--------+--------+ [...] | 2014-10-20T00:xx:xx | . | . | . | . | | 2014-10-20T01:xx:xx | . | . | . | . | | 2014-10-20T02:xx:xx | X | X | X | X | | 2014-10-20T03:xx:xx | . | . | . | . | | 2014-10-20T04:xx:xx | . | . | . | . | [...] +---------------------+--------+--------+--------+--------+ Statuses: . --> Partition is ok M --> Partition manually marked ok X --> Partition is not ok (duplicates, missing, or nulls) pass /home/qchris/cluster-scripts/dump_webrequest_status.sh
It seems that somewhere between 2014-10-20T02:05:00 and 2014-10-20T02:12:00 analytics1021 again got kicked out of its partition leader role. I now ran leader elections, so analytics1021 is ready to help with esams bits today in the evening.
From the logs between 2014-10-20T02:05:08 2014-10-20T02:05:16, data worth <2 seconds got lost. It's noteworthy that we again did not see loss for the hosts that we tuned the ACKs for. So I think we should move forward to roll out the ACK experiment to more hosts, so we can get rid of issues when analytics1021 drops out of its leader role again.
(In reply to christian from comment #2) > So I think we should move forward to roll out the > ACK experiment to more hosts, so we can get rid of issues when > analytics1021 drops out of its leader role again. Patches to roll out the ACK experiment got uploaded to gerrit https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:kafka-acks,n,z (for not yet merged parts) and have been linked to big 69667.
s/big 69667/bug 69667/