Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-8848

Flink: Memory efficient GBK implementation for batch runner

Details

    • Improvement
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • None
    • 2.23.0
    • runner-flink

    Description

      In current batch runner, all the values for a single key need to fit in memory, because the resulting GBK iterable is materialized using "List" data structure.

      Implications:

      • This blocks user from using custom sharding in most of the IOs, as the whole shard needs to fit in memory.
      • Frequent OOM failures in case of skewed data (pipeline should be running slow instead of failing). This is super hard to debug for inexperienced user.

      We can do way better for non-merging windows, the same way we do for Spark runner. Only drawback is, that this implementation does not support result re-iterations. We'll support turning this implementation on and off, if user needs to trade off reiterations for memory efficiency.

      Attachments

        Issue Links

          Activity

            People

              dmvk David Morávek
              dmvk David Morávek
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m