Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9476

[C++][Dataset] HivePartitioning discovery with dictionary types fails for multiple fields

    XMLWordPrintableJSON

Details

    Description

      Apparently, ARROW-9288 was not fully / correctly fixing the issue. With a single string partition field, it now works fine. But once you have multiple string fields, you get parsing errors.

      A reproducible example:

      import numpy as np
      import pyarrow as pa
      import pyarrow.parquet as pq
      import pyarrow.dataset as ds 
      
      foo_keys = np.array(['a', 'b', 'c'], dtype=object)
      bar_keys = np.array(['d', 'e', 'f'], dtype=object)
      N = 30
      
      table = pa.table({
          'foo': foo_keys.repeat(10),
          'bar': np.tile(np.tile(bar_keys, 5), 2),
          'values': np.random.randn(N)
      })
      
      base_path = "test_partition_directories3"
      pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"])
      
      # works
      ds.dataset(base_path, partitioning="hive")
      # fails
      part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
      ds.dataset(base_path, partitioning=part)
      

      cc bkietz

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m