[BEAM-7094] DataflowRunner does not scale when reading gzip file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: P3
Resolution: Unresolved
Affects Version/s: 2.11.0
Fix Version/s: None
Component/s: runner-dataflow, sdk-py-core
Labels:
None
Environment:
Python on Dataflow

Description

Hi,

I have a pipe that ReadFromText() a 700mb gz file from a GS bucket.

It then parse json, create BigQuery row, and WriteToBigQuery.

The pipeline above does not scale. If I specify 2 workers on startup it will scale it down to 1 and the throughput remains the same. The job takes 30 minutes.

What I found is that the exact same pipeline, reading the same but uncompressed 11gb file from the same location scales very well. The job only takes 5 minutes.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: github.com/moander

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Apr/19 09:41

Updated:: 03/Jun/22 23:51