Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-29099

Deadlock for Single Subtask in Kinesis Consumer

    XMLWordPrintableJSON

Details

    • Patch

    Description

      Deadlock is reached as the result of:

      • max lookahead reached for local watermark
      • idle state for subtask

      The lookahead prevents the RecordEmitter from emitting a new record. The idle state prevents the global watermark from being updated.

      To exit this deadlock state, we need to complete the TODO here which updates the global watermark while the subtask is marked idle, which will then allow us to emit a record again as the lookahead is no longer reached.

       

      Context:

      We reached this scenario at Lyft as a result of prolonged CPU throttling on all FlinkKinesisConsumer threads for multiple minutes.

      Walking through the series of events for a single subtask:

      • prolonged CPU throttling occurs and no logs are seen from any FlinkKinesisConsumer thread for up to 15 minutes
      • after CPU throttling the subtask is marked idle
      • the subtask has reached the lookahead for its local watermark relative to the global watermark
      • WatermarkSyncCallback indicates the subtask as idle and does not update the global watermark
      • emitQueue fills to max
      • RecordEmitter cannot emit records due to the max lookahead
      • Deadlock on subtask

      At this point, we had not realized what had happened and processing of all other shards/subtasks had continued for multiple days. When we finally restarted the application, we saw the following behavior:

      • global watermark recalculated after all subtasks consumed data based on the last kinesis record sequence number
      • global watermark moved back in time multiple days, to when the subtask was first marked idle
      • the single subtask processed data while all others remained idle due to the lookahead

      This would have continued until the subtask had caught up to the others and thus the global watermark is within reach of the lookahead for other subtasks.

       

      Repro:

      Too difficult to repro the exact scenario.

      Attachments

        Activity

          People

            sethsaperstein seth saperstein
            sethsaperstein seth saperstein
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified