Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-14440

Local state wipeout with EOS

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.2.3
    • None
    • streams
    • None

    Description

      Hey,

      I have a kafka streams service that aggregates events from multiple input topics (running in a k8s cluster). The topology has multiple FKJs. The input topics have around 7 billion events when the service was started from `earliest`.

      The service has EOS enabled and 

      transaction.timeout.ms: 600000

      The problem I am having is with frequent local state wipe-outs, this leads to very long rebalances. As can be seen from the attached images, local disk sizes go to ~ 0 very often. These wipe out are part of the EOS guarantee based on this log message: 

      State store transfer-store did not find checkpoint offsets while stores are not empty, since under EOS it has the risk of getting uncommitted data in stores we have to treat it as a task corruption error and wipe out the local state of task 1_8 before re-bootstrapping

       

      I noticed that this happens as a result of one of the following:

      • Process gets sigkill when running out of memory or on failure to shutdown gracefully on pod rotation for example, this explains the missing local checkpoint file, but for some reason I thought local checkpoint updates are frequent, so I expected to get part of the state to be reset but not the whole local state.
      • Although we have a  long transaction timeout config, this appears many times in the logs, after which kafka streams gets into error state. On startup, local checkpoint file is not found:
      Transiting to abortable error state due to org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.

      The service has 10 instances all having the same behaviour. The issue disappears when EOS is disabled.

      The kafka cluster runs kafka 2.6, with minimum isr of 3.

       

       

      Attachments

        1. Screenshot 2022-12-02 at 09.26.27.png
          438 kB
          Abdullah alkhawatrah

        Issue Links

          Activity

            People

              Unassigned Unassigned
              akhawatrah Abdullah alkhawatrah
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: