Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-6064

Python BigQuery performance much worse than Java

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • 2.8.0
    • None
    • sdk-py-core
    • None

    Description

      The performance of reading from BigQuery in Python seems to be much worse than the performance of it in Java.

      To reproduce this, I've run the following two programs on the Google Cloud, which basically read the weights from the public data set "natality" and outputs the top 100 largest weights.

      Python:

      # <cut imports>
      
      options = PipelineOptions()
      options.view_as(StandardOptions).runner = 'DataflowRunner'
      # <cut more options>
      
      pipeline = Pipeline(options=options)
      (pipeline
          | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT weight_pounds FROM [bigquery-public-data:samples.natality]'))
          | 'MapToFloat' >> beam.Map(lambda elem: elem['weight_pounds'])
          | 'Top' >> beam.combiners.Top.Largest(100)
          | 'MapToString' >> beam.Map(lambda elem: str(elem))
          | 'Write' >> beam.io.WriteToText("<output-file>"))
      
      pipeline.run()
      

       Java:

      // <cut imports>
      
      public class Natality {
          public static void main(String[] args) {
              DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
              options.setRunner(DataflowRunner.class);
              // <cut more options>
              
              Pipeline pipeline = Pipeline.create(options);
      
              pipeline.apply("Read", BigQueryIO.readTableRows()
                  .fromQuery("SELECT weight_pounds FROM [bigquery-public-data:samples.natality]"))
                  .apply("MapToDouble", MapElements
                      .into(TypeDescriptors.doubles())
                      .via(row -> {
                           Object obj = row.get("weight_pounds");
                           return (obj == null ? 0.0 : (Double) obj);
                      }))
                  .apply("Top", Top.largest(100))
                  .apply("MapToString", MapElements
                      .into(TypeDescriptors.strings())
                      .via(weight -> weight.toString()))
                  .apply("Write", TextIO.write().to("<output-file>"));
      
              pipeline.run().waitUntilFinish();
          }
      }
      

      The "<cut more options>" are basic options like project, job name, temp location, etc. Both programs produce identical outputs.

      Running these programs launches a DataFlow job on the Google Cloud with the following results (data from the Google Cloud Platform web interface; screenshots attached).

      Python:

      Read Succeeded 1 hr 40 min 40 sec
      MapToFloat Succeeded 2 min 43 sec
      Top Succeeded 5 min 25 sec
      MapToString Succeeded 0 sec
      Write Succeeded 3 sec

      Java:

      Read Succeeded 4 min 45 sec
      MapToDouble Succeeded 45 sec
      Top Succeeded 52 sec
      MapToString Succeeded 0 sec
      Write Succeeded 1 sec
      

      As you can see, there is an enormous performance hit in Python w.r.t. the reading from BigQuery: 1h40m vs less than 5 minutes.

      Furthermore the other standard operations (like Top) are also much slower in Python than in Java.

       

      Attachments

        1. Screenshot from 2019-02-01 10-10-45.png
          86 kB
          Javier Domingo Cansino
        2. results-python.png
          102 kB
          Jan Kuipers
        3. results-java.png
          98 kB
          Jan Kuipers

        Activity

          People

            Unassigned Unassigned
            jankuipers Jan Kuipers
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 6h
                6h