[ARROW-18269] [C++] Slash character in partition value handling - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.0
Fix Version/s: 11.0.0
Component/s: C++, Python
Labels:
- good-first-issue
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/33448

Description

Provided example shows that pyarrow does not handle partition value that contains '/' correctly:

import pandas as pd
import pyarrow as pa

from pyarrow import dataset as ds

df = pd.DataFrame({
    'value': [1, 2],
    'instrument_id': ['A/Z', 'B'],
})

ds.write_dataset(
    data=pa.Table.from_pandas(df),
    base_dir='data',
    format='parquet',
    partitioning=['instrument_id'],
    partitioning_flavor='hive',
)

table = ds.dataset(
    source='data',
    format='parquet',
    partitioning='hive',
).to_table()

tables = [table]

df = pa.concat_tables(tables).to_pandas()  tables = [table]

df = pa.concat_tables(tables).to_pandas() 

print(df.head())

Result:

   value instrument_id
0      1             A
1      2             B

Expected behaviour:
Option 1: Result should be:

   value instrument_id
0      1             A/Z
1      2             B

Option 2: Error should be raised to avoid '/' in partition value.

Attachments

Issue Links

links to

GitHub Pull Request #14646

Activity

People

Assignee:: Vibhatha Lakmal Abeykoon

Reporter:: Vadym Dytyniak

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 07/Nov/22 15:11

Updated:: 11/Jan/23 11:59

Resolved:: 05/Dec/22 14:36

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3.5h