Last modified: 2014-11-03 15:55:13 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74812, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 72812 - Kafka logs drowning in kafka.server.KafkaApis errors processing fetch requests since 2014-10-28 ~20:00


Summary:	Kafka logs drowning in kafka.server.KafkaApis errors processing fetch request...

Status:	RESOLVED WONTFIX

Product:	Analytics
Classification:	Unclassified
Component:	Refinery (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	upstream

Depends on:
Blocks:	72550
	Show dependency tree / graph

Reported:	2014-10-31 13:34 UTC by christian
Modified:	2014-11-03 15:55 UTC (History)
CC List:	7 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description christian 2014-10-31 13:34:40 UTC

Since 2014-10-28 ~20:00, the kafka logs on each four brokers are
drowning [1] in exceptions around processing fetch requests, like the
following one:

[2014-10-31 13:20:01,425] 7574685637 [kafka-request-handler-1] ERROR kafka.server.KafkaApis  - [KafkaApi-22] Error when processing fetch request for partition [webrequest_text,8] offset 25962495797 from consumer with correlation id 2
kafka.common.OffsetOutOfRangeException: Request for offset 25962495797 but we only have log segments in the range 24383057344 to 25960520433.
        at kafka.log.Log.read(Log.scala:380)
        at kafka.server.KafkaApis.kafka$server$KafkaApis$$readMessageSet(KafkaApis.scala:530)
        at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$readMessageSets$1.apply(KafkaApis.scala:476)
        at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$readMessageSets$1.apply(KafkaApis.scala:471)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
        at scala.collection.immutable.Map$Map3.foreach(Map.scala:163)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:194)
        at scala.collection.immutable.Map$Map3.map(Map.scala:143)
        at kafka.server.KafkaApis.kafka$server$KafkaApis$$readMessageSets(KafkaApis.scala:471)
        at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:437)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
        at java.lang.Thread.run(Thread.java:744)


The exceptions are really strongly biased towards the first second of a minute. [2]
This happens every minute, with some 60 extra exceptions every 5th minute. [3]


[1]

qchris@analytics1012:~$ grep 'ERROR kafka.server.KafkaApis' /var/log/kafka/kafka.log | cut -f 1 -d ' ' | uniq -c
      6 [2014-08-21
     18 [2014-08-22
      9 [2014-08-28
     12 [2014-09-09
      3 [2014-09-16
     47 [2014-09-17
     28 [2014-09-18
      6 [2014-10-08
     30 [2014-10-16
     12 [2014-10-23
   1563 [2014-10-28
   8640 [2014-10-29
   8610 [2014-10-30
    834 [2014-10-31

qchris@analytics1018:~$ grep 'ERROR kafka.server.KafkaApis' /var/log/kafka/kafka.log | cut -f 1 -d ' ' | uniq -c
      6 [2014-08-12
      6 [2014-08-21
     18 [2014-08-22
      9 [2014-08-28
     12 [2014-09-09
      3 [2014-09-16
     45 [2014-09-17
     31 [2014-09-18
      6 [2014-10-08
     30 [2014-10-16
     12 [2014-10-23
   1527 [2014-10-28
   8640 [2014-10-29
   8610 [2014-10-30
    834 [2014-10-31

chris@analytics1021:~$ cat /var/log/kafka/kafka.log.1 /var/log/kafka/kafka.log | grep 'ERROR kafka.server.KafkaApis' | cut -f 1 -d ' ' | uniq -c
      4 [2014-06-24
      2 [2014-07-02
      1 [2014-08-04
      4 [2014-08-12
      4 [2014-08-21
     10 [2014-08-22
      6 [2014-08-28
     10 [2014-09-09
      2 [2014-09-16
     32 [2014-09-17
     21 [2014-09-18
      6 [2014-10-08
     30 [2014-10-16
     12 [2014-10-23
   1380 [2014-10-28
   8640 [2014-10-29
   8610 [2014-10-30
    834 [2014-10-31

qchris@analytics1022:~$ grep 'ERROR kafka.server.KafkaApis' /var/log/kafka/kafka.log | cut -f 1 -d ' ' | uniq -c
      9 [2014-09-09
      3 [2014-09-16
     34 [2014-09-17
     20 [2014-09-18
      6 [2014-10-08
     38 [2014-10-16
     10 [2014-10-23
    508 [2014-10-28
   2890 [2014-10-29
   2877 [2014-10-30
    280 [2014-10-31



[2]
qchris@analytics1021:~$ cat /var/log/kafka/kafka.log.1 /var/log/kafka/kafka.log | grep 'ERROR kafka.server.KafkaApis' | cut -c 1-60 | grep 2014-10 | cut -f 3 -d : | cut -f 1 -d , | sort | uniq -c
     12 00
  17466 01
   1916 02
     82 03
      6 04
      6 06
      6 07
      6 23
      6 42
      6 46
      6 52
      6 56



[3]
qchris@analytics1021:~$ cat /var/log/kafka/kafka.log.1 /var/log/kafka/kafka.log | grep 'ERROR kafka.server.KafkaApis' | cut -c 1-60 | grep 2014-10 | cut -f 2 -d : | sort | uniq -c
    380 00
    314 01
    314 02
    314 03
    314 04
    368 05
    314 06
    312 07
    312 08
    312 09
    366 10
    306 11
    306 12
    318 13
    318 14
    378 15
    318 16
    318 17
    318 18
    318 19
    372 20
    312 21
    318 22
    318 23
    318 24
    378 25
    312 26
    312 27
    312 28
    312 29
    372 30
    312 31
    312 32
    312 33
    312 34
    372 35
    312 36
    312 37
    312 38
    312 39
    366 40
    312 41
    312 42
    312 43
    312 44
    372 45
    318 46
    312 47
    312 48
    312 49
    366 50
    312 51
    318 52
    312 53
    312 54
    366 55
    312 56
    312 57
    320 58
    332 59

Comment 1 christian 2014-10-31 13:41:29 UTC

The time 2014-10-28 ~20:00 nicely matches the leader re-election of
bug 72550 comment 1.

Comment 2 Andrew Otto 2014-10-31 14:42:43 UTC

Ah, this is caused by Ellery's use of kafkacat on stat1002.  It looks like kafkacat does not properly create the offset directory if it doesn't arleady exist.  Instead, it creates a single file and stored some number in there.  That number is the offset being requested in these error messages.  I've advised Ellery to create the offset directory before running kafkacat.  I will also report this as a bug to Magnus on kafkacat's github home.

Comment 3 christian 2014-11-01 12:09:18 UTC

Yay!
Logs are much cleaner now again.
Thanks!

Comment 4 christian 2014-11-01 12:09:26 UTC

(In reply to Andrew Otto from comment #2)
> I
> will also report this as a bug to Magnus on kafkacat's github home.

It seems the ticket is at:

  https://github.com/edenhill/kafkacat/issues/9

Comment 5 christian 2014-11-01 12:11:24 UTC

(In reply to Andrew Otto from comment #2)
> Ah, this is caused by Ellery's use of kafkacat on stat1002.

Yikes!

kafkacat not properly creating offset directories is ... well a
bug. Kafkacat is fresh, so I figure these kinds of issues are
expected. No worries.

And the logs are clean right now.

However, I am thinking this bug shows three other important things,
that should get discussed before closing the bug:

* kafkacat can get our kafka (not kafkacat) logs to become
  unusable. That came somewhat as a surprise to me.
  Do we expect people to use kafkacat more and we need to guard
  against noise in the logs, or will kafkacat's use die out (except
  for debugging) and people use ETL-ed data or kafkatee instead?

* kafkacat did not recover. Even if kafkacat does not properly survive
  the leader re-election, I had expected it to recover at some
  point. But in the end, it did not recover.

  This item can be explained away, by seeing that kafkacat really was
  run in acron, and hence restarting again and again with the faulty
  offsets.

* How comes it happened right after the leader election?
  The timing matches too well to be a simple coincidence.
  Did the use af kafkacat really start right at the time we re-elected?
  Do we know how kafkacat (maybe kafkatee) survives re-elecetions?

Hence, re-opening, as I think these three items need more discussion.

Comment 6 Andrew Otto 2014-11-03 14:35:37 UTC

* will kafkacat's use die out 
I don't think we will officially support regular users to use kafkacat.  kafkatee is much better and more useful.  kafkacat is really nice for debugging.  And, if it doesn't work, we can always use the kafka console-consumer that ships with Kafka.  Ellery was just using this kafkatee hadn't been available.  kafkacat isn't even officially installed anywhere.  He was just using a statically compiled binary in my homedir on stat1002.

* How comes it happened right after the leader election?
I am likely to be wrong, but I would think it is a coincidence.  I had been corresponding with Ellery last week about how he can use kafkatee, and I think he started to use it around the same time.  You said the logs started to appear around 10 minutes or so after the election, right?  Again, I have not investigated this, and I could be wrong.

> Hence, re-opening, as I think these three items need more discussion.
I think we can close this.  Unless we intend to vet and support kafkacat in production, we don't need to track issues here.

Comment 7 christian 2014-11-03 15:55:13 UTC

Thanks for the follow-up.

I guess you're right about closing, if we don't intend to support kafkacat.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links