Details
-
Bug
-
Status: Open
-
P3
-
Resolution: Unresolved
-
2.11.0
-
None
-
None
-
Python on Dataflow
Description
Hi,
I have a pipe that ReadFromText() a 700mb gz file from a GS bucket.
It then parse json, create BigQuery row, and WriteToBigQuery.
The pipeline above does not scale. If I specify 2 workers on startup it will scale it down to 1 and the throughput remains the same. The job takes 30 minutes.
What I found is that the exact same pipeline, reading the same but uncompressed 11gb file from the same location scales very well. The job only takes 5 minutes.