[ARROW-10883] [C++][Dataset] Preserve order when writing dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 11.0.0
Component/s: C++
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/26818

Description

Currently, when writing a dataset, e.g. from a table consisting of a set of record batches, there is no guarantee that the row order is preserved when reading the dataset.

Small code example:

In [1]: import pyarrow.dataset as ds

In [2]: table = pa.table({"a": range(10)})

In [3]: table.to_pandas()
Out[3]: 
   a
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9

In [4]: batches = table.to_batches(max_chunksize=2)

In [5]: ds.write_dataset(batches, "test_dataset_order", format="parquet")

In [6]: ds.dataset("test_dataset_order").to_table().to_pandas()
Out[6]: 
   a
0  4
1  5
2  8
3  9
4  6
5  7
6  2
7  3
8  0
9  1

Although this might seem normal in SQL world, typical dataframe users (R, pandas/dask, etc) will expect a preserved row order.
Some applications might also rely on this, eg with dask you can have a sorted index column ("divisions" between the partitions) that would get lost this way (note, the dask parquet writer itself doesn't use pyarrow.dataset.write_dataset so isn't impacted by this.)

Some discussion about this started in https://github.com/apache/arrow/pull/8305 (~~ARROW-9782~~), which changed to write all fragments to a single file instead of a file per fragment.

I am not fully sure what the best way to solve this, but IMO at least having the option to preserve the order would be good.

cc bkietz

Attachments

Issue Links

is duplicated by

ARROW-16506 Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True

Closed

requires

ARROW-17762 [C++] Add ordering information to exec batches

Open

Activity

People

Assignee:: Unassigned

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 11/Dec/20 12:49

Updated:: 11/Jan/23 08:16