Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-7753

All File based IO to provide flexibility to plugin custom logic to create output element from data and file metadata

Details

    • Improvement
    • Status: In Progress
    • P3
    • Resolution: Unresolved
    • None
    • None
    • io-java-files
    • None

    Description

      Currently the structure of different File IO classes seem to be to let the Format specific IO (e.g. TextIO, XmlIO, etc) provide a SourceFunction that knows how to split a file for that specific format and how to read records for that format.

      However, the developer/end-user has no choice in terms of how the output element is constructed or what its type would be.

      For example, each format specific IO will typically convert from PCollection<ReadableFile> --> PCollection<T> where T varies for different file formats (E.g. T = String for TextIO while T = Pojo generated from XSD for XmlIO and so on)

      At the moment, the end-user can add a ParDo of <T> --> <OUT> i.e. convert the PCollection<T> --> PCollection<OUT>

      However, OUT in the above case can only be constructed from file data and the user has no easy way to get access to the file metadata from which the record T originated.

      For example, the OUT record might need to contain metadata of the file location from which the record originated.

      i.e. We want f(T, ReadableFile) -> OUT instead of f(T) -> OUT

      To do this, every File based IO should provide the user the flexibility to plugin a function that gives the user control to create OUT from Data + Metadata (T + ReadableFile + Other Metadata where applicable)

      I would be happy to take up and implement this task if folks feel that this is a worthy goal to achieve in the File based IOs.

      Possible solutions:

      1. The simpler solution (but less flexible) would be to simply convert ReadAllViaFileBasedSource.ReadFileRangesFn from DoFn<KV<ReadableFile, OffsetRange>, T> --> to --> DoFn<KV<ReadableFile, OffsetRange>, KV<ReadableFile, T>>
      or by extention convert ReadAllViaFileBasedSource from PTransform<PCollection<ReadableFile>, PCollection<T>> --> to --> PTransform<PCollection<ReadableFile>, PCollection<KV<ReadableFile, T>>>

      However, this approach is restrictive in the sense that we assume that the only metadata the user is interested in is the metadata available within ReadableFile.
      If the user needs to have access to other metadata information like offset ranges or other format specific metadata, then this design wont allow for that.

      2. The more flexible solution is to allow the user to configure a function, say EncodeFn<T, OUT> with a signature that looks like OUT encode(ReadableFile, T). That way the user has full control over the type of OUT and the user also has access to metadata (ReadableFile) and can thus build OUT from data + metadata (T + ReadableFile)

      The first option then simply becomes a special case of this, where we use EncodeFn<T, KV<ReadableFile, T> (i.e. OUT = KV<ReadableFile, T>)
      Also, it is easy to maintain backward compatibility with existing readAll() features of all File Based IOs since they essentially evaluate to a special case where we use EncodeFn<T, T> (OUT = T)

      This change would need to be done in homogenous way across all the existing File Based IO classes

      Attachments

        Activity

          People

            Unassigned Unassigned
            soumabrata Soumabrata Chakraborty
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: