Last modified: 2014-10-29 14:21:19 UTC
Every few days, analytics1021 gets kicked out of its partition leader role. The last few occurrences are: 2014-08-06: bug 69244 2014-08-13: bug 69665 2014-08-16: bug 69666
Thanks for bugging this Christian. IMHO, this has been a fairly regular problem with this host and attempts to reimage/fix have not been successful. Andrew/Gage -- can you update?
The core of this issue is a timeout of the Zookeeper connection, which neither Gage nor I have been able to solve. Quick summary: Kafka brokers need to maintain a live connection with Zookeeper in order to remain in the ISR. Brokers set a timeout. If a broker can't talk to Zookeeper within this timeout, it will close the connection it has and attempt to open a new one, most likely to a different Zookeeper host than it had before. Zookeeper notices when it closes this connection, and then tells all of the other brokers that this broker has left the ISR. Its leadership for any partitions is then demoted. It takes this broker usually about less than a second to reconnect to another Zookeeper host and rejoin the cluster. Thus far, when this happens, Gage or I have logged in and manually started a preferred replica election. This bring's the offending broker's leadership status back to normal. This would only be an annoyance, if it weren't for small varnishkafka hiccups this causes. Ideally, when a broker loses partition leadership, producers would be notified of the metadata change quickly enough that their buffer's don't fill up. We have noticed that for some higher volume partitions (upload and/or bits), some varnishkafkas drop messages during the short time that leadership metadata is being propagated. Action items: - Solve zookeeper timeout. Not sure how to replicate or do this right now. - Keep varnishkafka from dropping messages on metadata change. There is likely some tuning we can do to make sure we don't drop messages when partition leadership changes. - Investigate auto partition balancing. This is a new feature in Kafka 0.8.1. This would eliminate the manual step of starting a preferred replica election. This won't solve either of the above problems, but would allow the cluster to rebalance itself without manual intervention when this happens. See also: https://rt.wikimedia.org/Ticket/Display.html?id=6877
Today I increased kafka.queue.buffering.max.ms from 1 second to 5 seconds. Leader election can take up to 3 seconds. Hopefully this will solve the second bullet point above. We'll have to wait until the next time a leader election happens and see if we miss any data from it.
It happened again on: 2014-08-24 ~16:00 (with recovery for the next reading in ganglia. Since ganglia shows a decrease of volume for that time, it might align with the kafkatee changes on analytics1003. So it might actually be an artifact.) 2014-08-30 ~03:30 (with recovery ~06:00) 2014-09-02 ~10:50 (not yet recovered) (I haven't yet checked if those instances also caused packetloss)
(In reply to christian from comment #4) > 2014-08-24 ~16:00 (with recovery for the next reading in Wrong day. It should read 2014-08-25 ~16:00 (with recovery for the next reading in . Sorry.
(In reply to Andrew Otto from comment #3) > Today I increased kafka.queue.buffering.max.ms from 1 second to 5 seconds. > Leader election can take up to 3 seconds. Hopefully this will solve the > second bullet point above. We'll have to wait until the next time a leader > election happens and see if we miss any data from it. The re-elections from 2014-08-30 ~03:30 2014-09-02 ~10:50 (see comment #5) happened past the buffer increase, but both still exhibit missing sequence numbers. So bumping the buffer size did not do the trick :-(
I have done some more sleuthing on the zookeeper timeouts today, and I want to record my thoughts. - The timeout can happen when the broker is connected to any zookeeper. - Timeouts happen more often than timeout induced leader changes. That is, a Zookeeper session timeout does necessarily mean produce errors. - When there are timeout induced leader changes, they seem to be caused by the zookeeper leader expiring a zookeeper session, rather than a broker reconnecting to zookeeper because of a timeout. That is, both zookeeper server and kafka broker (zookeeper client) set a timeout of 16 seconds. (I believe they negotiate to the lower timeout if the setting don't match). If kafka notices the timeout first, it will just close the connection and reconnect with the same session id. If the timeout happens to be noticed by zookeeper before kafka, then zookeeper expires the session. Kafka's own timeout will trigger after the session has been expired by zookeeper, and when it attempts to reconnect, it will be told it has an expired session, which causes it to have to reconnect a second time to ask for a new session. - This also seems related to timeouts occurring when the broker (usually analytics1021) is also the controller for the kafka cluster. (The Controller is the one in charge of intra-cluster things like ISRs and leaders...I think). As far as I can tell, all of the timeout induced leader changes that we see are also accompanied by (on broker) kafka.controller.KafkaController$SessionExpirationListener - [SessionExpirationListener on 21], ZK expired; shut down all controller components and try to re-elect and (on zookeeper leader) [ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException when processing sessionid:0x46fd72a6d4243f type:create cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/kafka/eqiad/controller Error:KeeperErrorCode = NodeExists for /kafka/eqiad/controller error messages. That is, when the timed-out broker finally is able to reconnect (with a new session id), it attempts to re-register its previous controller status with zookeeper, only to find that another broker has taken over as controller. I just went back to my email thread to the Kafka Users group and noticed that someone had replied[1] to my sleuthing about broker GCs back in July! I never saw this email response! Gah! I will look into this suggestion tomorrow. [1] http://mail-archives.apache.org/mod_mbox/kafka-users/201407.mbox/%3CCAFbh0Q2f71qgs5JDNFxkm7SSdZyYMH=ZpEOxotuEQfKqeXQHfw@mail.gmail.com%3E
Adding some more records: To rule out temporary basic network problems, I started a script to ping all zookeepers from analytics1021 in an endless loop. That produced results already: There was no output for ~10 seconds between 15:26:04 and 15:26:14 But RTT times were low before and afterwards. Since we're hardly having data to compare against, I collected the analytics1021 ganglia graphs that showed a exceptional in-/decrease during that period, and added the relevant parts of the kafka log and kafka gc log. (See attachment analytics1021-2014-09-04T15-26-03.tar.gz) It seems most disks spiked. kafka log looks harmless. kafka gc log looks harmless too. -------------------------------------------- Since both the pings and kafka gc stopped during that time, it might be it's not network related, but the machine itself preempting processes and blocking them for some time. So I started another monitor that (in an endless loop) logs the timestamp, and then waits a second. That already produced relevant results too: There was no output for ~13 seconds between 20:47:18 and 20:47:31 . Again most disks spiked. kafka log looks harmless. kafka gc log looks harmless too. -------------------------------------------- Since I lack access to the system's logs, do they show anything interesting around 15:26:04 -- 15:26:14, or 20:47:18 -- 20:47:31 ?
Created attachment 16376 [details] analytics1021-2014-09-04T15-26-03.tar.gz
Trying to understand why it only happens an analytics1021 and not the other brokers, the only difference that I could find is that analytic1021 is the only broker that does not have $ganglia_aggregator set to true. I am not sure if that is relevant, but seeing the quoted email from comment #7 ... could the $ganglia_aggregator setting change the machine's characteristics enough to warrant analytics1021 exhibiting this strange behaviour?
Yeah, strange indeed that this only happens on analytics1021. I *think* we have seen this elsewhere before, but not often. And, I think not since the cluster reinstall in July. > root@palladium:~# salt -E 'analytics10(12|18|21|22).eqiad.wmnet' cmd.run 'grep -i "timed out" /var/log/kafka/kafka.log | wc -l' > analytics1012.eqiad.wmnet: > 0 > analytics1018.eqiad.wmnet: > 0 > analytics1021.eqiad.wmnet: > 80 > analytics1022.eqiad.wmnet: > 4
This happened again on 2014-09-06 around 19:43:22. Here are some relevant sar captures: https://gist.github.com/ottomata/ae372649afc914a6f606 Something is happening with paging stats, but it is unclear if it is a syptom or a cause. Starting around 19:43:11, pages are being freed up really fast. Then, at 19:43:22, activity drops out, as if activity is blocked by something. At 19:43:33, a ton more pages are freed all at once Then it seems to go back to normal. Seeing as the actual block starts at around :22, and there is weird behavior before that, maybe paging is part of the cause.
Change 163744 had a related patch set uploaded by QChris: Force ACKs from all in-sync kafka replicas https://gerrit.wikimedia.org/r/163744
Happened again on 2014-10-13 around 13:37:31.
Hi Christian - would you say that the fix did not solve this problem? thanks, -Toby
(In reply to Toby Negrin from comment #15) > would you say that the fix did not solve this problem? If you are refering to the change from comment 13, it totally did the trick. For the machines it addresses, the issue is gone. Those machines no longer loose data. While the fix helps with the issue of losing data, it does not yet prevent analytics1021 from dropping out.
Change 167550 had a related patch set uploaded by QChris: Require 2 ACKs from kafka brokers for mobile caches https://gerrit.wikimedia.org/r/167550
Change 167551 had a related patch set uploaded by QChris: Require 2 ACKs from kafka brokers for text caches https://gerrit.wikimedia.org/r/167551
Change 167552 had a related patch set uploaded by QChris: Require 2 ACKs from kafka brokers for bits caches https://gerrit.wikimedia.org/r/167552
Change 167553 had a related patch set uploaded by QChris: Require 2 ACKs from kafka brokers per default https://gerrit.wikimedia.org/r/167553
Change 167550 merged by Ottomata: Require 2 ACKs from kafka brokers for mobile caches https://gerrit.wikimedia.org/r/167550
Set vm.dirty_writeback_centisecs = 200 (was 500)
Change 167551 merged by Ottomata: Require 2 ACKs from kafka brokers for text caches https://gerrit.wikimedia.org/r/167551
Change 167552 merged by Ottomata: Require 2 ACKs from kafka brokers for bits caches https://gerrit.wikimedia.org/r/167552
Change 167553 merged by Ottomata: Require 2 ACKs from kafka brokers per default https://gerrit.wikimedia.org/r/167553