Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-12256

auto commit causes delays due to retriable UNKNOWN_TOPIC_OR_PARTITION

    XMLWordPrintableJSON

Details

    Description

      In KAFKA-6829 a change was made to the consumer to internally retry commits upon receiving UNKNOWN_TOPIC_OR_PARTITION.

      Though this helped mitigate issues around stale broker metadata, there were some valid concerns around the negative effects for routine topic deletion:

      https://github.com/apache/kafka/pull/4948

      In particular, if a commit is issued for a deleted topic, retries can block the consumer for up to max.poll.interval.ms. This is tunable of course, but any amount of stalling in a consumer can lead to unnecessary lag.

      One of the assumptions while permitting the change was that in practice it should be rare for commits to occur for deleted topics, since that would imply messages were being read or published at the time of deletion. It's fair to expect users to not delete topics that are actively published to. But this assumption is false in cases where auto commit is enabled.

      With the current implementation of auto commit, the consumer will regularly issue commits for all topics being fetched from, regardless of whether or not messages were actually received. The fetch positions are simply flushed, even when they are 0. This is simple and generally efficient, though it does mean commits are often redundant. Besides the auto commit interval, commits are also issued at the time of rebalance, which is often precisely at the time topics are deleted.

      This means that in practice commits for deleted topics are not really rare. This is particularly an issue when the consumer is subscribed to a multitude of topics using a wildcard. For example, a consumer might subscribe to a particular "flavor" of topic with the aim of auditing all such data, and these topics might dynamically come and go. The consumer's metadata and rebalance mechanisms are meant to handle this gracefully, but the end result is that such groups are often blocked in a commit for several seconds or minutes (the default is 5 minutes) whenever a delete occurs. This can sometimes result in significant lag.

      Besides having users abandon auto commit in the face of topic deletes, there are probably multiple ways to deal with this, including reconsidering if commits still truly need to be retried here, or if this behavior should be more configurable; e.g. having a separate commit timeout or policy. In some cases the loss of a commit and subsequent message duplication is still preferred to processing delays. And having an artificially low max.poll.interval.ms or rebalance.timeout.ms comes with its own set of concerns.

      In the very least the current behavior and pitfalls around delete with active consumers should be documented.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rleslie Ryan Leslie
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: