Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-3499

Watch can make no progress if a single poll takes more than checkpoint interval

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • None
    • 2.3.0
    • sdk-java-core

    Description

      E.g. when using it to poll a filepattern with hundreds of thousands of files, a single poll may take >10 seconds (default checkpoint interval in OutputAndTimeBoundedSplittableProcessElementInvoker). Because of that, the tracker (GrowthTracker) gets checkpointed before anything is added to it, i.e. before https://github.com/apache/beam/blob/0d918b7cab8c4ccb2b5e050501327912161d40a7/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L727, at a moment when it doesn't contain any useful information, so the residual checkpoint state is as empty as the initial one. When we resume from the residual checkpoint, the situation simply repeats - until we get lucky enough to either take <10s to poll, or to not be asked to checkpoint for >10s (e.g. cause the checkpointing thread isn't scheduled).

      One possible fix to this is to change the SDF checkpointing strategy to have a progress guarantee: e.g., start counting time from the moment the first block is claimed, or allow the tracker to refuse checkpointing if nothing is claimed yet, or something like that.

       

      A workaround for users of this (primarily via FileIO.match().continuously()) is to shard their filepattern into a set of finer-granularity filepatterns matching fewer files, so that each match call takes less than 10 seconds.

      Attachments

        Issue Links

          Activity

            People

              jkff Eugene Kirpichov
              jkff Eugene Kirpichov
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h