Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1405

[C++] 'Couldn't deserialize thrift' error when reading large binary column

    XMLWordPrintableJSON

Details

    Description

      We've run into issues reading Parquet files that contain long binary columns (utf8 strings).  In particular, we were generating WKT representations of polygons that contained ~34 million characters when we ran into the issue. 

      The attached example generates a dataframe with one record and one column containing a random string with 10^7 characters.

      Pandas (using the default pyarrow engine) successfully writes the file, but fails upon reading the file:

      ---------------------------------------------------------------------------
      ArrowIOError Traceback (most recent call last)
      <ipython-input-25-25d21204cbad> in <module>()
      ----> 1 df_read_in = pd.read_parquet('test.parquet')
      
      ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
      286 
      287 impl = get_engine(engine)
      --> 288 return impl.read(path, columns=columns, **kwargs)
      
      ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
      129 kwargs['use_pandas_metadata'] = True
      130 result = self.api.parquet.read_table(path, columns=columns,
      --> 131 **kwargs).to_pandas()
      132 if should_close:
      133 try:
      
      ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, nthreads, metadata, use_pandas_metadata)
      1044 fs = _get_fs_from_path(source)
      1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
      -> 1046 use_pandas_metadata=use_pandas_metadata)
      1047 
      1048 pf = ParquetFile(source, metadata=metadata)
      
      ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in read_parquet(self, path, columns, metadata, schema, nthreads, use_pandas_metadata)
      175 filesystem=self)
      176 return dataset.read(columns=columns, nthreads=nthreads,
      --> 177 use_pandas_metadata=use_pandas_metadata)
      178 
      179 def open(self, path, mode='rb'):
      
      ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
      896 partitions=self.partitions,
      897 open_file_func=open_file,
      --> 898 use_pandas_metadata=use_pandas_metadata)
      899 tables.append(table)
      900 
      
      ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, partitions, open_file_func, file, use_pandas_metadata)
      459 table = reader.read_row_group(self.row_group, **options)
      460 else:
      --> 461 table = reader.read(**options)
      462 
      463 if len(self.partition_keys) > 0:
      
      ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
      150 columns, use_pandas_metadata=use_pandas_metadata)
      151 return self.reader.read_all(column_indices=column_indices,
      --> 152 nthreads=nthreads)
      153 
      154 def scan_contents(self, columns=None, batch_size=65536):
      
      ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all()
      
      ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
      
      ArrowIOError: Couldn't deserialize thrift: No more data to read.
      Deserializing page header failed.
      

       

      Attachments

        1. parquet-issue-example.py
          0.3 kB
          Jeremy Heffner

        Activity

          People

            mdeepak Deepak Majeti
            jheffnerrseg Jeremy Heffner
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h
                3h