Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-9620

textio (and fileio in general) takes too long to estimate sizes of large globs

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • None
    • None
    • sdk-py-core
    • None

    Description

      As a workaround we could introduce a way to not perform size estimation when reading large globs. For example Java SDK has withHintMatchesManyFiles() option.

       

      https://github.com/apache/beam/blob/850e8469de798d45ec535fe90cb2dc5dbda4974a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L371

       

      Additionally, seems like we are repeating the size estimation where the same PCollection read from a file-based source is applied to multiple PTransforms.

       

      See following for more details.

      https://stackoverflow.com/questions/60874942/avoid-recomputing-size-of-all-cloud-storage-files-in-gcsio-beam-python-sdk

      Attachments

        Activity

          People

            Unassigned Unassigned
            chamikara Chamikara Madhusanka Jayalath
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: