Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-7026

Python SDK: Unable to obtain the PCollection for output tags which are not consumed by a downstream step.



    • Type: New Feature
    • Status: In Progress
    • Priority: P2
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: sdk-py-harness
    • Labels:


      I noticed that we are not able to convert the output tag+transform to the pcollection name for metrics (element count/mean byte count), if the Pcollections for the outputed tags are not consumed by a downstream step.

      This isn't critical as (1) Arguably there is no pcollection at all. (2) Output but not consumed PCollections are not critical to count metrics on as those can be optomized away entirely (No need to do any work, collect metrics, etc. for an unconsumed pcollection).

      However, we are able to count this, but we are unable to assign a pcollection name for it, as in this case there is no information about that output tag defined in the bundle descriptor. The alternative fix is to make sure that its always available, even if not consumed.

      Pablo and I looked into this a bit, and he believed it would be possible in pvalue.py'sĀ 

      DoOutputsTuple class. This fix would require callingĀ _getitem_ on all tags to initialize them properly. However, I had some trouble doing this, as this class is a bit strange since it overrides _getattr. I found weird behaviors when adding functionality to this code. I don't really get how the code functions today, as its own instance variable usage should trigger the custom __getattr_ code, yet we seem to be using these attrs normally with self.X usages.




            • Assignee:
              ajamato@google.com Alex Amato
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created: