Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.15.1
-
None
Description
When a user saves a dataframe:
df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')
it will create sub-directories named "a=val1", "a=val2" in /tmp/table. Each of them will contain one (or more?) parquet files with random filenames.
If a user runs the same command again, the code will use the existing sub-directories, but with different (random) filenames. As a result, any data loaded from this folder will be wrong - each row will be present twice.
For example, when using
df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow') # second time df2 = pd.read_parquet('/tmp/table', engine='pyarrow') assert len(df1) == len(df2) # raise an error
This is a subtle change in the data that can pass unnoticed.
I would expect that the code will prevent the user from using an non-empty destination as partitioned target. an overwrite flag can also be useful.
Attachments
Issue Links
- is related to
-
ARROW-12358 [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
- Open