Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-12406

Progressing watermark for not available Kinesis stream

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • 2.27.0
    • None
    • io-java-kinesis
    • None

    Description

      We use Dataflow with Apache Beam to read events from Kinesis streams. Recently, we've spotted that in a case when one of the streams was not available in the middle of events processing (due to removal or problem with the credentials), the data watermark for this stream was still being updated.  

      Imagine scenario:

      1. Permissions allow to read from stream A
      2. Data is read from stream A
      3. Permissions are changed and don’t allow to read from stream A
      4. Watermark for stream A is progressing (but stream data is not read due to permissions issue)
      5. Permissions are fixed to read stream A
      6. Data is read from stream A but from the updated watermark

      As a result, stream data between steps 3-5 is lost and the client doesn’t know that.

      Additionally, it may be confusing from the Dataflow console perspective, as it suggests that events are still being read from the stream. It is hard to rely on the watermark as a source metric for alerting purposes as well.

      Brief investigation suggests that maybe the KinesisReader.getWatermark() logic doesn’t consider the state of the stream i.e. is it available or not, and it treats the removed stream as a stream without traffic. Watermark calculation should be adjusted to take that information into account.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mateuszratajocado Mateusz Rataj
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: