Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7702

[C++][Dataset] Provide (optional) deterministic order of batches

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • 1.0.0
    • C++, Python

    Description

      Example with python:

      import pyarrow as pa
      import pyarrow.parquet as pq
      
      table = pa.table({'a': range(12)}) 
      pq.write_table(table, "test_chunks.parquet", chunk_size=3) 
      
      # reading with dataset
      import pyarrow.dataset as ds
      ds.dataset("test_chunks.parquet").to_table().to_pandas()
      

      gives non-deterministic result (order of the row groups in the parquet file):

      In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas()                                                                                                                                                  
      Out[25]: 
           a
      0    0
      1    1
      2    2
      3    3
      4    4
      5    5
      6    6
      7    7
      8    8
      9    9
      10  10
      11  11
      
      In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas()                                                                                                                                                  
      Out[26]: 
           a
      0    0
      1    1
      2    2
      3    3
      4    8
      5    9
      6   10
      7   11
      8    4
      9    5
      10   6
      11   7
      
      

      Attachments

        Issue Links

          Activity

            People

              fsaintjacques Francois Saint-Jacques
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: