Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10883

[C++][Dataset] Preserve order when writing dataset

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 11.0.0
    • C++
    • None

    Description

      Currently, when writing a dataset, e.g. from a table consisting of a set of record batches, there is no guarantee that the row order is preserved when reading the dataset.

      Small code example:

      In [1]: import pyarrow.dataset as ds
      
      In [2]: table = pa.table({"a": range(10)})
      
      In [3]: table.to_pandas()
      Out[3]: 
         a
      0  0
      1  1
      2  2
      3  3
      4  4
      5  5
      6  6
      7  7
      8  8
      9  9
      
      In [4]: batches = table.to_batches(max_chunksize=2)
      
      In [5]: ds.write_dataset(batches, "test_dataset_order", format="parquet")
      
      In [6]: ds.dataset("test_dataset_order").to_table().to_pandas()
      Out[6]: 
         a
      0  4
      1  5
      2  8
      3  9
      4  6
      5  7
      6  2
      7  3
      8  0
      9  1
      

      Although this might seem normal in SQL world, typical dataframe users (R, pandas/dask, etc) will expect a preserved row order.
      Some applications might also rely on this, eg with dask you can have a sorted index column ("divisions" between the partitions) that would get lost this way (note, the dask parquet writer itself doesn't use pyarrow.dataset.write_dataset so isn't impacted by this.)

      Some discussion about this started in https://github.com/apache/arrow/pull/8305 (ARROW-9782), which changed to write all fragments to a single file instead of a file per fragment.

      I am not fully sure what the best way to solve this, but IMO at least having the option to preserve the order would be good.

      cc bkietz

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: