Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-5392

GroupByKey on Spark: All values for a single key need to fit in-memory at once

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.6.0
    • 2.11.0
    • runner-spark

    Description

      Currently, when using GroupByKey, all values for a single key need to fit in-memory at once.
       
      There are following issues, that need to be addressed:
      a) We can not use Spark's groupByKey, because it requires all values to fit in memory for a single key (it is implemented as "list combiner")
      b) ReduceFnRunner iterates over values multiple times in order to group also by window
       
      Solution:
       
      In Dataflow Worker code, there are optimized versions of ReduceFnRunner, that can take advantage of having elements for a single key sorted by timestamp.
       
      We can use Spark's `repartitionAndSortWithinPartitions` in order to meet this constraint.
       
      For non-merging windows, we can put window itself into the key resulting in smaller groupings.
       
      This approach was already tested in ~100TB input scale on Spark 2.3.x. (custom Spark runner).
       
      I'll submit a patch once the Dataflow Worker code donation is complete.

      Attachments

        Issue Links

          Activity

            People

              dmvk David Morávek
              dmvk David Morávek
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 8h 50m
                  8h 50m