Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11216

StreamingDataflowWorker ReaderCache usage can be incorrect in presence of retries

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • None
    • 2.27.0
    • runner-dataflow
    • None

    Description

      This is similar to BEAM-7547. The issue and identified there was related to the state cache. However for UnboundedSource there is a separate ReaderCache. In particular the following sequence could lead to using a Reader at the wrong position.

      1. Work arrives indicating to use reader at state C1
      2. Work processes and advances reader to C2, commit is prepared with elements output from C1 to C2 and reader checkpoint for C2
      3. Commit of original processing fails (perhaps routed to previous backend during autoscaling)
      4. Work is retried (still indicating to use reader at C1 and still same cache token) however the ReaderCache is used, there is a hit, and the existing reader (positioned at C2) is used.
      5. Retry of work processes and advances reader to C3, commit is scheduled with elements output from C2 to C3 and reader checkpoint for C3
      6. Commit succeeds

      At this point there was never a successful commit for the elements between C1 to C2, though the reader is now advanced past them.

      Possible fixes:
      1. Use increasing work token as in BEAM-7547 to detect retry and not use the ReaderCache entry
      2. Seek recovered reader from cache to checkpoint or only use if it matches the checkpoint. This would probably involve changing the Checkpoint interface so 1 is likely preferred.

      Attachments

        Issue Links

          Activity

            People

              scwhittle Sam Whittle
              scwhittle Sam Whittle
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m