Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-8547

2 __consumer_offsets partitions grow very big

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.1.1
    • None
    • log cleaner
    • None
    • Ubuntu 18.04, Kafka 2.1.12-2.1.1, running as systemd service

    Description

      It seems like log cleaner doesn't clean old data of  __consumer_offsets on the default policy of compact on that topic. It may eventually cause disk to run out or for the servers to run out of memory.

      We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in __consumer_offsets. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). 

      When we tried dumping those old segments, we see:

       

      1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 00000000161728257775.log --offsets-decoder --print-data-log --deep-iteration
       Dumping 00000000161728257775.log
       Starting offset: 161728257775
       offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746
       offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761
       ...

      It looks like all those old segments all contain transactional information (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is endTxnMarker and the value is coordinatorEpoch...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? Interestingly, other partitions also contain transactional information so it's quite curious how 2 specific partitions were not able to be cleaned. 

      There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead so I opened a new one, please feel free to merge!

      Attachments

        Activity

          People

            Unassigned Unassigned
            Lerh Lerh Chuan Low
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: