Last modified: 2014-10-31 12:52:30 UTC
Between 2014-10-13T13:xx:xx and 2014-10-13T22:xx:xx several partitions, were not marked successful [1]. It seems bits was most affected, followed by upload and to a lesser extent text and mobile. What happened? [1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 11:07:47 // exit code: 0 cwd: ~ cluster-scripts/dump_webrequest_status.sh +---------------------+--------+--------+--------+--------+ | Date | bits | text | mobile | upload | +---------------------+--------+--------+--------+--------+ [...] | 2014-10-13T11:xx:xx | . | . | . | . | | 2014-10-13T12:xx:xx | . | . | . | . | | 2014-10-13T13:xx:xx | X | X | X | X | | 2014-10-13T14:xx:xx | . | . | . | . | | 2014-10-13T15:xx:xx | X | . | . | . | | 2014-10-13T16:xx:xx | X | . | . | . | | 2014-10-13T17:xx:xx | X | . | . | . | | 2014-10-13T18:xx:xx | X | . | . | . | | 2014-10-13T19:xx:xx | X | . | . | X | | 2014-10-13T20:xx:xx | X | . | . | X | | 2014-10-13T21:xx:xx | X | . | . | X | | 2014-10-13T22:xx:xx | . | . | . | . | | 2014-10-13T23:xx:xx | . | . | . | . | [...] +---------------------+--------+--------+--------+--------+ Statuses: . --> Partition is ok X --> Partition is not ok (duplicates, missing, or nulls) pass cluster-scripts/dump_webrequest_status.sh
For 2014-10-13T13:xx:xx it affected all caches with the only exception of cp1056.eqiad.wmnet (bits) cp1057.eqiad.wmnet (bits) cp3019.esams.wikimedia.org (bits) cp3020.esams.wikimedia.org (bits) (which are exactly the machines that saw the ACK experiments [1], and we did not see missing log lines for any of them.) For that hour, we saw no duplicates, but intermittent loss between 2014-10-13T13:37:15 and 2014-10-13T13:38:16 which is worth bits <1 second text <2 seconds mobile <2 seconds upload <1 second . This nicely matches the dropout of analytics1021 from its partition leader role [2]. I marked the 2014-10-13T13:xx:xx partitions as ok. [1] https://git.wikimedia.org/blob/operations%2Fpuppet.git/ccc17ce0780f6c56ddcac4f4dcd9f90b2dc0d346/manifests%2Frole%2Fcache.pp#L510 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=69667#c14
The failed partitions between 2014-10-13T15:xx:xx--2014-10-13T21:xx:xx have all exclusively been esams caches. Hence, filing under the esams bug.
(Since it also is about analytics1021 dropping out of it's leader role, also blocking on bug 69667)