Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16506

Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None

    Description

      In the latest (8.0.0) release the following code snippet seems to write out data in a different order for each of the partitions when use_threads=True vs when use_threads=False.

      Testing the same snippet with pyarrow 7.0.0 gives the same order regardless of whether use_threads is set to True when the data is written.

       

      import itertools
      
      import numpy as np
      import pyarrow.dataset as ds
      import pyarrow as pa
      
      n_rows, n_cols = 100_000, 20
      
      def create_dataframe(color, year):
          arr = np.random.randn(n_rows, n_cols)
          df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in range(n_cols)])
          df["color"] = color
          df["year"] = year
          df["id"] = np.arange(len(df))
          return df
      
      
      partitions = ["red", "green", "blue"]
      years = [2011, 2012, 2013]
      dataframes = [create_dataframe(p, y) for p, y in itertools.product(partitions, years)]
      df = pd.concat(dataframes)
      
      table = pa.Table.from_pandas(df=df)
      
      ds.write_dataset(
          table,
          "./test",
          format="parquet",
          max_rows_per_group=1_000_000,
          min_rows_per_group=1_000_000,
          existing_data_behavior="overwrite_or_ignore",
          partitioning=ds.partitioning(pa.schema([
              ("color", pa.string()),
              ("year", pa.int64())
          ]), flavor="hive"),
          use_threads=True,
      )
      
      df_read = pd.read_parquet("./test/color=blue/year=2012")
      df_read.head()[["id"]]
      
      

       

      Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dannyfri22 Daniel Friar
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: