Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-7745

StreamingSideInputDoFnRunner/StreamingSideInputFetcher have suboptimal state access pattern during normal operation

Details

    • Improvement
    • Status: Resolved
    • P3
    • Resolution: Fixed
    • None
    • 2.33.0
    • runner-dataflow
    • None

    Description

      I spent some time tracking down sources of uncached state fetches in my job, and one large category was the interaction of StreamingSideInputDoFnRunner + StreamingSideInputFetcher.

      Basically, during standard operations, when the main input is NOT blocked by the side input, the side input fetcher will perform an uncached state read for every input element.  Changing it to cache the blockedMap state gave me a ~30-40% increase in throughput in my job.

      The interaction is a little complicated, and there's a couple optimizations here I can see.

       

      Primarily, the blockedMap is only persisted if it is non-empty.  Because the WindmillStateCache won't cache a null value, this means that the "nothing is blocked" signal is never actually cached, and will issue a state read to windmill for each input element.  The solution here seems like it is to persist an empty map rather than a null when there are no blocked elements.

       

      Attachments

        Issue Links

          Activity

            People

              SteveNiemitz Steve Niemitz
              SteveNiemitz Steve Niemitz
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m