Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15943

[C++] Filter which files to be read in as part of filesystem, filtered using a string

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      There is a report from a user (see this Stack Overflow post [1]) who has used the basename_template parameter to write files to a dataset, some of which have the prefix "summary" and others which have the prefix "prediction".  This data is saved in partitioned directories.  They want to be able to read back in the data, so that, as well as the partition variables in their dataset, they can choose which subset (predictions vs. summaries) to read back in. 

      This isn't currently possible; if they try to open a dataset with a list of files, they cannot read it in as partitioned data.

      A short-term solution is to suggest they change the structure of how their data is stored, but it could be useful to be able to pass in some sort of filter to determine which files get read in as a dataset.

       

      [1] https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)

      Attachments

        Activity

          People

            Unassigned Unassigned
            thisisnic Nicola Crane
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: