[BEAM-6064] Python BigQuery performance much worse than Java - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: P3
Resolution: Unresolved
Affects Version/s: 2.8.0
Fix Version/s: None
Component/s: sdk-py-core
Labels:
None

Description

The performance of reading from BigQuery in Python seems to be much worse than the performance of it in Java.

To reproduce this, I've run the following two programs on the Google Cloud, which basically read the weights from the public data set "natality" and outputs the top 100 largest weights.

Python:

# <cut imports>

options = PipelineOptions()
options.view_as(StandardOptions).runner = 'DataflowRunner'
# <cut more options>

pipeline = Pipeline(options=options)
(pipeline
    | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT weight_pounds FROM [bigquery-public-data:samples.natality]'))
    | 'MapToFloat' >> beam.Map(lambda elem: elem['weight_pounds'])
    | 'Top' >> beam.combiners.Top.Largest(100)
    | 'MapToString' >> beam.Map(lambda elem: str(elem))
    | 'Write' >> beam.io.WriteToText("<output-file>"))

pipeline.run()

Java:

// <cut imports>

public class Natality {
    public static void main(String[] args) {
        DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
        options.setRunner(DataflowRunner.class);
        // <cut more options>
        
        Pipeline pipeline = Pipeline.create(options);

        pipeline.apply("Read", BigQueryIO.readTableRows()
            .fromQuery("SELECT weight_pounds FROM [bigquery-public-data:samples.natality]"))
            .apply("MapToDouble", MapElements
                .into(TypeDescriptors.doubles())
                .via(row -> {
                     Object obj = row.get("weight_pounds");
                     return (obj == null ? 0.0 : (Double) obj);
                }))
            .apply("Top", Top.largest(100))
            .apply("MapToString", MapElements
                .into(TypeDescriptors.strings())
                .via(weight -> weight.toString()))
            .apply("Write", TextIO.write().to("<output-file>"));

        pipeline.run().waitUntilFinish();
    }
}

The "<cut more options>" are basic options like project, job name, temp location, etc. Both programs produce identical outputs.

Running these programs launches a DataFlow job on the Google Cloud with the following results (data from the Google Cloud Platform web interface; screenshots attached).

Python:

Read Succeeded 1 hr 40 min 40 sec
MapToFloat Succeeded 2 min 43 sec
Top Succeeded 5 min 25 sec
MapToString Succeeded 0 sec
Write Succeeded 3 sec

Java:

Read Succeeded 4 min 45 sec
MapToDouble Succeeded 45 sec
Top Succeeded 52 sec
MapToString Succeeded 0 sec
Write Succeeded 1 sec

As you can see, there is an enormous performance hit in Python w.r.t. the reading from BigQuery: 1h40m vs less than 5 minutes.

Furthermore the other standard operations (like Top) are also much slower in Python than in Java.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screenshot from 2019-02-01 10-10-45.png
01/Feb/19 10:14
86 kB
Javier Domingo Cansino
results-python.png
14/Nov/18 14:09
102 kB
Jan Kuipers
results-java.png
14/Nov/18 14:09
98 kB
Jan Kuipers

Issue Links

links to

GitHub Pull Request #12485

GitHub Pull Request #12489

Activity

People

Assignee:: Unassigned

Reporter:: Jan Kuipers

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/Nov/18 14:07

Updated:: 03/Jun/22 23:03

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: