Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8163 [C++][Dataset] Allow FileSystemDataset's file list to be lazy
  3. ARROW-17318

[C++][Dataset] Support async streaming interface for getting fragments in Dataset

    XMLWordPrintableJSON

Details

    Description

      Add `GetFragmentsAsync()` and `GetFragmentsAsyncImpl()` functions to the generic `Dataset` interface, which allows to produce fragments in a streamed fashion.

      This is one of the prerequisites for making `FileSystemDataset` to support lazy fragment processing, which, in turn, can be used to start scan operations without waiting for the entire dataset to be discovered.

      To aid the transition process of moving to async implementation in `Dataset`/`AsyncScanner` code, a default implementation for `GetFragmentsAsyncImpl()` should be provided (yielding a VectorGenerator over the fragments vector, which is stored by every implementation of Dataset interface at the moment).

      Attachments

        Issue Links

          Activity

            People

              psolodovnikov Pavel Solodovnikov
              psolodovnikov Pavel Solodovnikov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 40m
                  3h 40m