Last modified: 2014-11-03 15:55:13 UTC
Since 2014-10-28 ~20:00, the kafka logs on each four brokers are drowning [1] in exceptions around processing fetch requests, like the following one: [2014-10-31 13:20:01,425] 7574685637 [kafka-request-handler-1] ERROR kafka.server.KafkaApis - [KafkaApi-22] Error when processing fetch request for partition [webrequest_text,8] offset 25962495797 from consumer with correlation id 2 kafka.common.OffsetOutOfRangeException: Request for offset 25962495797 but we only have log segments in the range 24383057344 to 25960520433. at kafka.log.Log.read(Log.scala:380) at kafka.server.KafkaApis.kafka$server$KafkaApis$$readMessageSet(KafkaApis.scala:530) at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$readMessageSets$1.apply(KafkaApis.scala:476) at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$readMessageSets$1.apply(KafkaApis.scala:471) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) at scala.collection.immutable.Map$Map3.foreach(Map.scala:163) at scala.collection.TraversableLike$class.map(TraversableLike.scala:194) at scala.collection.immutable.Map$Map3.map(Map.scala:143) at kafka.server.KafkaApis.kafka$server$KafkaApis$$readMessageSets(KafkaApis.scala:471) at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:437) at kafka.server.KafkaApis.handle(KafkaApis.scala:186) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42) at java.lang.Thread.run(Thread.java:744) The exceptions are really strongly biased towards the first second of a minute. [2] This happens every minute, with some 60 extra exceptions every 5th minute. [3] [1] qchris@analytics1012:~$ grep 'ERROR kafka.server.KafkaApis' /var/log/kafka/kafka.log | cut -f 1 -d ' ' | uniq -c 6 [2014-08-21 18 [2014-08-22 9 [2014-08-28 12 [2014-09-09 3 [2014-09-16 47 [2014-09-17 28 [2014-09-18 6 [2014-10-08 30 [2014-10-16 12 [2014-10-23 1563 [2014-10-28 8640 [2014-10-29 8610 [2014-10-30 834 [2014-10-31 qchris@analytics1018:~$ grep 'ERROR kafka.server.KafkaApis' /var/log/kafka/kafka.log | cut -f 1 -d ' ' | uniq -c 6 [2014-08-12 6 [2014-08-21 18 [2014-08-22 9 [2014-08-28 12 [2014-09-09 3 [2014-09-16 45 [2014-09-17 31 [2014-09-18 6 [2014-10-08 30 [2014-10-16 12 [2014-10-23 1527 [2014-10-28 8640 [2014-10-29 8610 [2014-10-30 834 [2014-10-31 chris@analytics1021:~$ cat /var/log/kafka/kafka.log.1 /var/log/kafka/kafka.log | grep 'ERROR kafka.server.KafkaApis' | cut -f 1 -d ' ' | uniq -c 4 [2014-06-24 2 [2014-07-02 1 [2014-08-04 4 [2014-08-12 4 [2014-08-21 10 [2014-08-22 6 [2014-08-28 10 [2014-09-09 2 [2014-09-16 32 [2014-09-17 21 [2014-09-18 6 [2014-10-08 30 [2014-10-16 12 [2014-10-23 1380 [2014-10-28 8640 [2014-10-29 8610 [2014-10-30 834 [2014-10-31 qchris@analytics1022:~$ grep 'ERROR kafka.server.KafkaApis' /var/log/kafka/kafka.log | cut -f 1 -d ' ' | uniq -c 9 [2014-09-09 3 [2014-09-16 34 [2014-09-17 20 [2014-09-18 6 [2014-10-08 38 [2014-10-16 10 [2014-10-23 508 [2014-10-28 2890 [2014-10-29 2877 [2014-10-30 280 [2014-10-31 [2] qchris@analytics1021:~$ cat /var/log/kafka/kafka.log.1 /var/log/kafka/kafka.log | grep 'ERROR kafka.server.KafkaApis' | cut -c 1-60 | grep 2014-10 | cut -f 3 -d : | cut -f 1 -d , | sort | uniq -c 12 00 17466 01 1916 02 82 03 6 04 6 06 6 07 6 23 6 42 6 46 6 52 6 56 [3] qchris@analytics1021:~$ cat /var/log/kafka/kafka.log.1 /var/log/kafka/kafka.log | grep 'ERROR kafka.server.KafkaApis' | cut -c 1-60 | grep 2014-10 | cut -f 2 -d : | sort | uniq -c 380 00 314 01 314 02 314 03 314 04 368 05 314 06 312 07 312 08 312 09 366 10 306 11 306 12 318 13 318 14 378 15 318 16 318 17 318 18 318 19 372 20 312 21 318 22 318 23 318 24 378 25 312 26 312 27 312 28 312 29 372 30 312 31 312 32 312 33 312 34 372 35 312 36 312 37 312 38 312 39 366 40 312 41 312 42 312 43 312 44 372 45 318 46 312 47 312 48 312 49 366 50 312 51 318 52 312 53 312 54 366 55 312 56 312 57 320 58 332 59
The time 2014-10-28 ~20:00 nicely matches the leader re-election of bug 72550 comment 1.
Ah, this is caused by Ellery's use of kafkacat on stat1002. It looks like kafkacat does not properly create the offset directory if it doesn't arleady exist. Instead, it creates a single file and stored some number in there. That number is the offset being requested in these error messages. I've advised Ellery to create the offset directory before running kafkacat. I will also report this as a bug to Magnus on kafkacat's github home.
Yay! Logs are much cleaner now again. Thanks!
(In reply to Andrew Otto from comment #2) > I > will also report this as a bug to Magnus on kafkacat's github home. It seems the ticket is at: https://github.com/edenhill/kafkacat/issues/9
(In reply to Andrew Otto from comment #2) > Ah, this is caused by Ellery's use of kafkacat on stat1002. Yikes! kafkacat not properly creating offset directories is ... well a bug. Kafkacat is fresh, so I figure these kinds of issues are expected. No worries. And the logs are clean right now. However, I am thinking this bug shows three other important things, that should get discussed before closing the bug: * kafkacat can get our kafka (not kafkacat) logs to become unusable. That came somewhat as a surprise to me. Do we expect people to use kafkacat more and we need to guard against noise in the logs, or will kafkacat's use die out (except for debugging) and people use ETL-ed data or kafkatee instead? * kafkacat did not recover. Even if kafkacat does not properly survive the leader re-election, I had expected it to recover at some point. But in the end, it did not recover. This item can be explained away, by seeing that kafkacat really was run in acron, and hence restarting again and again with the faulty offsets. * How comes it happened right after the leader election? The timing matches too well to be a simple coincidence. Did the use af kafkacat really start right at the time we re-elected? Do we know how kafkacat (maybe kafkatee) survives re-elecetions? Hence, re-opening, as I think these three items need more discussion.
* will kafkacat's use die out I don't think we will officially support regular users to use kafkacat. kafkatee is much better and more useful. kafkacat is really nice for debugging. And, if it doesn't work, we can always use the kafka console-consumer that ships with Kafka. Ellery was just using this kafkatee hadn't been available. kafkacat isn't even officially installed anywhere. He was just using a statically compiled binary in my homedir on stat1002. * How comes it happened right after the leader election? I am likely to be wrong, but I would think it is a coincidence. I had been corresponding with Ellery last week about how he can use kafkatee, and I think he started to use it around the same time. You said the logs started to appear around 10 minutes or so after the election, right? Again, I have not investigated this, and I could be wrong. > Hence, re-opening, as I think these three items need more discussion. I think we can close this. Unless we intend to vet and support kafkacat in production, we don't need to track issues here.
Thanks for the follow-up. I guess you're right about closing, if we don't intend to support kafkacat.