Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18269

[C++] Slash character in partition value handling

    XMLWordPrintableJSON

Details

    Description

       

      Provided example shows that pyarrow does not handle partition value that contains '/' correctly:

      import pandas as pd
      import pyarrow as pa
      
      from pyarrow import dataset as ds
      
      df = pd.DataFrame({
          'value': [1, 2],
          'instrument_id': ['A/Z', 'B'],
      })
      
      ds.write_dataset(
          data=pa.Table.from_pandas(df),
          base_dir='data',
          format='parquet',
          partitioning=['instrument_id'],
          partitioning_flavor='hive',
      )
      
      table = ds.dataset(
          source='data',
          format='parquet',
          partitioning='hive',
      ).to_table()
      
      tables = [table]
      
      df = pa.concat_tables(tables).to_pandas()  tables = [table]
      
      df = pa.concat_tables(tables).to_pandas() 
      
      print(df.head())

      Result:

         value instrument_id
      0      1             A
      1      2             B 

      Expected behaviour:
      Option 1: Result should be:

         value instrument_id
      0      1             A/Z
      1      2             B 

      Option 2: Error should be raised to avoid '/' in partition value.

       

       

       

      Attachments

        Issue Links

          Activity

            People

              vibhatha Vibhatha Lakmal Abeykoon
              dytyniak Vadym Dytyniak
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3.5h
                  3.5h