Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13855

FileNotFoundException: Error while rolling log segment for topic partition in dir

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.6.1
    • None
    • log
    • None

    Description

      Hello,

      We faced an issue when one of Kafka broker in cluster has failed with an exception and restarted:

       

      [2022-04-13T09:51:44,563][ERROR][category=kafka.server.LogDirFailureChannel] Error while rolling log segment for prod_data_topic-7 in dir /var/opt/kafka/data/1
      java.io.FileNotFoundException: /var/opt/kafka/data/1/prod_data_topic-7/00000000000026872377.index (No such file or directory)
      	at java.base/java.io.RandomAccessFile.open0(Native Method)
      	at java.base/java.io.RandomAccessFile.open(Unknown Source)
      	at java.base/java.io.RandomAccessFile.<init>(Unknown Source)
      	at java.base/java.io.RandomAccessFile.<init>(Unknown Source)
      	at kafka.log.AbstractIndex.$anonfun$resize$1(AbstractIndex.scala:183)
      	at kafka.log.AbstractIndex.resize(AbstractIndex.scala:176)
      	at kafka.log.AbstractIndex.$anonfun$trimToValidSize$1(AbstractIndex.scala:242)
      	at kafka.log.AbstractIndex.trimToValidSize(AbstractIndex.scala:242)
      	at kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:508)
      	at kafka.log.Log.$anonfun$roll$8(Log.scala:1916)
      	at kafka.log.Log.$anonfun$roll$2(Log.scala:1916)
      	at kafka.log.Log.roll(Log.scala:2349)
      	at kafka.log.Log.maybeRoll(Log.scala:1865)
      	at kafka.log.Log.$anonfun$append$2(Log.scala:1169)
      	at kafka.log.Log.append(Log.scala:2349)
      	at kafka.log.Log.appendAsLeader(Log.scala:1019)
      	at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:984)
      	at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:972)
      	at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$4(ReplicaManager.scala:883)
      	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:273)
      	at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
      	at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
      	at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
      	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
      	at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
      	at scala.collection.TraversableLike.map(TraversableLike.scala:273)
      	at scala.collection.TraversableLike.map$(TraversableLike.scala:266)
      	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
      	at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:871)
      	at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:571)
      	at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:605)
      	at kafka.server.KafkaApis.handle(KafkaApis.scala:132)
      	at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:70)
      	at java.base/java.lang.Thread.run(Unknown Source)
      
      [2022-04-13T09:51:44,812][ERROR][category=kafka.log.LogManager] Shutdown broker because all log dirs in /var/opt/kafka/data/1 have failed 

      There are no any additional useful information in logs, just one warn before this error:

      [2022-04-13T09:51:44,720][WARN][category=kafka.server.ReplicaManager] [ReplicaManager broker=1] Broker 1 stopped fetcher for partitions __consumer_offsets-22,prod_data_topic-5,__consumer_offsets-30,
      ....
      prod_data_topic-0 and stopped moving logs for partitions  because they are in the failed log directory /var/opt/kafka/data/1.
      
      [2022-04-13T09:51:44,720][WARN][category=kafka.log.LogManager] Stopping serving logs in dir /var/opt/kafka/data/1

      The topic configuration is:

      /opt/kafka $ ./bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic prod_data_topic
      Topic: prod_data_topic        PartitionCount: 12      ReplicationFactor: 3    Configs: min.insync.replicas=2,segment.bytes=1073741824,max.message.bytes=15728640,retention.bytes=4294967296
              Topic: prod_data_topic        Partition: 0    Leader: 3       Replicas: 3,1,2 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 1    Leader: 1       Replicas: 1,2,3 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 2    Leader: 2       Replicas: 2,3,1 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 3    Leader: 3       Replicas: 3,2,1 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 4    Leader: 1       Replicas: 1,3,2 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 5    Leader: 2       Replicas: 2,1,3 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 6    Leader: 3       Replicas: 3,2,1 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 7    Leader: 1       Replicas: 1,3,2 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 8    Leader: 2       Replicas: 2,1,3 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 9    Leader: 3       Replicas: 3,1,2 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 10   Leader: 1       Replicas: 1,2,3 Isr: 3,2,1
              Topic: prod_data_topic        Partition: 11   Leader: 2       Replicas: 2,3,1 Isr: 3,2,1 

      Previously (a day before it happened) we have set "rettention.bytes" broker config to: 5368709120 (previously the values was 6442450944). But not sure it affected. Current custom broker config:

       

      log.retention.check.interval.ms=300000
      log.segment.bytes=1073741824
      log.retention.bytes=4294967296
      log.retention.hours=40
      
      
      message.max.bytes=15728640
      replica.lag.time.max.ms=30000
      min.insync.replicas=2
      delete.topic.enable=true
      replica.fetch.max.bytes=15728640
      default.replication.factor=3
      num.replica.fetchers=2 
      
      

       

      Could you please help to investigate what could be a reason of this fail? Because we don't have any ideas (there were no cleaning topics, files or other maintenance procedure with disk). 

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mrMigles Sergey Ivanov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: