Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6339

[Python][C++] Rowgroup statistics for pd.NaT array ill defined

    XMLWordPrintableJSON

Details

    Description

      When initialising an array with NaT only values the row group statistic is corrupt returning either random values or raises integer out of bound exceptions.

      import io
      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      df = pd.DataFrame({"t": pd.Series([pd.NaT], dtype="datetime64[ns]")})
      buf = pa.BufferOutputStream()
      pq.write_table(pa.Table.from_pandas(df), buf, version="2.0")
      buf = io.BytesIO(buf.getvalue().to_pybytes())
      parquet_file = pq.ParquetFile(buf)
      # Asserting behaviour is difficult since it is random and the state is ill defined. 
      # After a few iterations an exception is raised.
      while True:
          parquet_file.metadata.row_group(0).column(0).statistics.max
      

      Attachments

        Activity

          People

            uwe Uwe Korn
            fjetter Florian Jetter
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 40m
                2h 40m