Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7706

[Python] saving a dataframe to the same partitioned location silently doubles the data

    XMLWordPrintableJSON

Details

    Description

      When a user saves a dataframe:

      df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')
      

      it will create sub-directories named "a=val1", "a=val2" in /tmp/table. Each of them will contain one (or more?) parquet files with random filenames.

      If a user runs the same command again, the code will use the existing sub-directories, but with different (random) filenames. As a result, any data loaded from this folder will be wrong - each row will be present twice.

      For example, when using

      df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')  # second time
      
      df2 = pd.read_parquet('/tmp/table', engine='pyarrow')
      assert len(df1) == len(df2)  # raise an error

      This is a subtle change in the data that can pass unnoticed.

       

      I would expect that the code will prevent the user from using an non-empty destination as partitioned target. an overwrite flag can also be useful.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tsvikas Tsvika Shapira
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: