Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-10774

GBK Python streaming load tests are too slow

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      The following GBK streaming test cases take too long on Dataflow:

       

      1) 2GB of 10B records

      2) 2GB of 100B records

      4) fanout 4 times with 2GB 10-byte records total

      5) fanout 8 times with 2GB 10-byte records total

       

      Each of them needs at least 1 hour to execute, which is way too long for one Jenkins job.

      Job's definition: https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_LoadTests_GBK_Python.groovy

      Test pipeline: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/group_by_key_test.py

      It is probable that those cases are too extreme. The first two cases involve grouping 20M unique keys, which is a stressful operation. A solution might be to overhaul the cases so that they would be less complex.

      Both the current production Dataflow runner and the new Dataflow Runner V2 were tested.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kamilwu Kamil Wasilewski
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: