[ARROW-9476] [C++][Dataset] HivePartitioning discovery with dictionary types fails for multiple fields - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.17.1
Fix Version/s: 1.0.0
Component/s: C++
Labels:
- dataset
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25547

Description

Apparently, ~~ARROW-9288~~ was not fully / correctly fixing the issue. With a single string partition field, it now works fine. But once you have multiple string fields, you get parsing errors.

A reproducible example:

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds 

foo_keys = np.array(['a', 'b', 'c'], dtype=object)
bar_keys = np.array(['d', 'e', 'f'], dtype=object)
N = 30

table = pa.table({
    'foo': foo_keys.repeat(10),
    'bar': np.tile(np.tile(bar_keys, 5), 2),
    'values': np.random.randn(N)
})

base_path = "test_partition_directories3"
pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"])

# works
ds.dataset(base_path, partitioning="hive")
# fails
part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
ds.dataset(base_path, partitioning=part)

cc bkietz

Attachments

Issue Links

links to

GitHub Pull Request #7770

Activity

People

Assignee:: Ben Kietzman

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Jul/20 12:00

Updated:: 11/Jan/23 08:07

Resolved:: 15/Jul/20 21:32

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m