[BEAM-1787] Python DirectRunner silently blocks reading full query from Google Datastore - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: P3
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: Not applicable
Component/s: sdk-py-core
Labels:
- datastore
- python

Description

When I run a query (even with many splits) against the production datastore (such as in the datastore_wordcount demo), it operates as follows:

1. split the query into a bunch of split queries
2. run each split query, collecting the results
3. then pass the results to the following stage / ParDo

However, 2 is run to completion with DirectRunner before starting 3. So a large dataset must be fully downloaded before it attempts to run any of the following stages.

While it may make sense and local parallelism/pipelining might be impossible....there is no output or status messages. And debugging why my code appeared to hang before processing results, took forever to dig through code and instrument-log-debug all the beam code to figure out what was going on.

See https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/36 for more details

This happens with github head 0.7.0-dev (there was no "version" tag for this above).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Mike Lambert

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Mar/17 05:30

Updated:: 24/Jul/20 20:19

Resolved:: 12/Jun/19 16:29