Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-6243

TFX pipelines experience a huge blowup in intermediate data size

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • None
    • None
    • runner-flink
    • None

    Description

      The elements in TFX intermediate collections are dictionaries of (typically single-element) numpy arrays, which are (relatively) expensive to serialize (e.g. using pickle for the numpy wrapper of a primitive int/float, repeating the column names in every element).

      Though it'd be good to use a better intermediate representation, this is exacerbated because the fusion algorithm does not pack as much possible into executable stages.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              robertwb Robert Bradshaw
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 20m
                  2h 20m