Details
-
Bug
-
Status: Resolved
-
P3
-
Resolution: Won't Fix
-
None
Description
When I run a query (even with many splits) against the production datastore (such as in the datastore_wordcount demo), it operates as follows:
1. split the query into a bunch of split queries
2. run each split query, collecting the results
3. then pass the results to the following stage / ParDo
However, 2 is run to completion with DirectRunner before starting 3. So a large dataset must be fully downloaded before it attempts to run any of the following stages.
While it may make sense and local parallelism/pipelining might be impossible....there is no output or status messages. And debugging why my code appeared to hang before processing results, took forever to dig through code and instrument-log-debug all the beam code to figure out what was going on.
See https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/36 for more details
This happens with github head 0.7.0-dev (there was no "version" tag for this above).