Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14772

[Python] unexpected content after groupby on a dataframe restored from partitioned parquet with filters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 6.0.1
    • None
    • Parquet, Python

    Description

      While experimenting with the partitioned dataset persistence in parquet, I stumbled upon an interesting feature (or bug?) where after restoring only a certain partition and applying groupby I suddenly get all the filtered rows in the dataframe. 

      Following code demonstrates the issue:

      import numpy as np
      import os
      import pandas as pd  # 1.3.4
      import pyarrow as pa  # 6.0.1
      import random
      import shutil
      import string
      import tempfile
      
      from datetime import datetime, timedelta
      
      if __name__ == '__main__':
          # 1. generate random data frame
          day_count = 5
          data_length = 10
      
          numpy_random_gen = np.random.default_rng()
          label_choices = [''.join(random.choices(string.ascii_uppercase + string.digits, k=8)) for _ in range(5)]
          partial_dfs = []
      
          start_date = datetime.today().date() - timedelta(days=day_count)
          for date in (start_date + timedelta(n) for n in range(day_count)):
              date_array = pd.to_datetime(np.full(data_length, date)).date
      
              label_array = np.full(data_length, [random.choice(label_choices) for _ in range(data_length)])
      
              value_array = numpy_random_gen.integers(low=1, high=500, size=data_length)
      
              partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': label_array, 'value': value_array}))
      
          df = pd.concat(partial_dfs, ignore_index=True)
          print(f"Unique dates before restore:\n{df.drop_duplicates(subset='date')['date']}")
      
          # 2. persist data frame partitioned by date
          dataset_dir = tempfile.mkdtemp()
      
          df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 'label'])
      
          # 3. restore from parquet partitioned dataset
          restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
              ('date', '=', str(start_date))], use_legacy_dataset=False)
          print(f"Unique dates after restore:\n{restored_df.drop_duplicates(subset='date')['date']}")
      
          group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum')
          print(group_by_df)
      
          shutil.rmtree(dataset_dir) 

      It correctly reports five unique dates upon random df generation and correctly reports only one after reading back from parquet:

      Unique dates after restore:
      0    2021-11-13
      Name: date, dtype: category
      Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', '2021-11-16', '2021-11-17']

      Albeit it adds that there are 5 categories. When subsequently I perform a groupby, all dates that were filtered out at read miracolously appear:

          group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum')
          print(group_by_df)
      

      With the following output:

                date     label  val_sum
      0   2021-11-13  04LOXJCH      494
      1   2021-11-13  4QOZ321D      819
      2   2021-11-13  GG6YO5FS      394
      3   2021-11-13  J7ZD3LDS      203
      4   2021-11-13  TFVIXE6L      164
      5   2021-11-14  04LOXJCH        0
      6   2021-11-14  4QOZ321D        0
      7   2021-11-14  GG6YO5FS        0
      8   2021-11-14  J7ZD3LDS        0
      9   2021-11-14  TFVIXE6L        0
      10  2021-11-15  04LOXJCH        0
      11  2021-11-15  4QOZ321D        0
      12  2021-11-15  GG6YO5FS        0
      13  2021-11-15  J7ZD3LDS        0
      14  2021-11-15  TFVIXE6L        0
      15  2021-11-16  04LOXJCH        0
      16  2021-11-16  4QOZ321D        0
      17  2021-11-16  GG6YO5FS        0
      18  2021-11-16  J7ZD3LDS        0
      19  2021-11-16  TFVIXE6L        0
      20  2021-11-17  04LOXJCH        0
      21  2021-11-17  4QOZ321D        0
      22  2021-11-17  GG6YO5FS        0
      23  2021-11-17  J7ZD3LDS        0
      24  2021-11-17  TFVIXE6L        0

      Perhaps I am doing something incorrectly within read_parquet call or something, but my expectation would be for filtered data just be gone after the read operation.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vadik_mironov Vadim Mironov
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: