Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-13721

ZeroDivisionError if source bundle smaller than 1mb

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • 2.24.0
    • None
    • io-py-gcp
    • None
    • Important

    Description

      Hi,
      I built a (GCP) DataFlow + apache-beam (version 2.24.0) pipeline, using python.
      The pipeline's stages are:
      1. reading from BigQuery
      2. running a custom function (using the ParDo while passing it a class inheriting 'beam.DoFn'), that transforms the read data.
      3. writing the transformed data to BigQuery
       
      The pipeline works fine when stage 1 is querying a small amount of data, but when it is querying data from the last six months (lots of data), I am getting this error:
       

      apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
      Traceback (most recent call last):
        File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
          work_executor.execute()
        File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute
          op.start()
        File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
        File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
        File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
        File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
        File "/usr/local/lib/python3.7/site-packages/dataflow_worker/workercustomsources.py", line 69, in _{_}iter{_}_
          self._source.start_position, self._source.stop_position)
        File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 78, in get_range_tracker
          start_position, stop_position, self._source_bundles)
        File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 131, in _{_}init{_}_
          self._compute_cumulative_weights(source_bundles[start[0]:last]) + [1] *
        File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 154, in _compute_cumulative_weights
          running_total.append(max(min_diff, min(1, running_total[-1] + w / total)))
      ZeroDivisionError: float division by zero
      

       
      I saw this issue in your repository:
       
      https://issues.apache.org/jira/browse/BEAM-10004
       
      which is referencing the same problem I have, but even though this fix is already implemented in my version of apache-beam (2.24.0), I still get this error.
       
      Can you please guide me in fixing this issue?
      Is it something I am doing wrong or is this your bug?
       
      Thank you in advance and have a good one!
       
      Inbar Dekel

      Attachments

        Activity

          People

            Unassigned Unassigned
            inbar.dekel Inbar Dekel
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: