Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-6723

Reshuffle in streaming pipeline does not work on dataflowrunner python

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • None
    • 2.16.0
    • None

    Description

      When using a Reshuffle after windowing, the dataflowrunner gives the following error:

      "org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast to org.apache.beam.sdk.transforms.windowing.GlobalWindow"

      This makes it impossible to prevent fusion of operations and to distribute our workload nicely.

       

      For more context, see following post where robertwb linked this Jira board:

      https://stackoverflow.com/questions/54764081/google-dataflow-streaming-pipeline-is-not-distributing-workload-over-several-wor

      Quick summary: Our use case is actually video analytics. We want to use the windowing to get small intervals of videos and the grouping to group per video stream. The group by key is thus a game id. What comes out the window and grouping is several windows within one game so with 1 game_id key and this does not distribute over several workers. The proposed workaround would be to add the window_id to the key after windowing, then reshuffle and then the processing would be able to run in parallel per window in one videostream. Can someone confirm this approach (if reshuffle would work).

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bcoghe Brecht Coghe
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: