Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-5987

Cache side inputs on Spark runner for performance

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.8.0
    • 2.9.0
    • runner-spark
    • None

    Description

      We did some profiling of a spark job and 90% of the application time was spent on side input deserialization.

      For spark, an easy fix is to cache materialized side inputs per bundle. This improved running time of the profiled job from 3 hours to 30 minutes.

      Attachments

        Activity

          People

            dmvk David Morávek
            dmvk David Morávek
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 9h 40m
                9h 40m