[PARQUET-1405] [C++] 'Couldn't deserialize thrift' error when reading large binary column - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: cpp-4.0.0
Component/s: parquet-cpp
Labels:
- parquet
- pull-request-available
Environment:

Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3

Description

We've run into issues reading Parquet files that contain long binary columns (utf8 strings). In particular, we were generating WKT representations of polygons that contained ~34 million characters when we ran into the issue.

The attached example generates a dataframe with one record and one column containing a random string with 10^7 characters.

Pandas (using the default pyarrow engine) successfully writes the file, but fails upon reading the file:

---------------------------------------------------------------------------
ArrowIOError Traceback (most recent call last)
<ipython-input-25-25d21204cbad> in <module>()
----> 1 df_read_in = pd.read_parquet('test.parquet')

~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
286 
287 impl = get_engine(engine)
--> 288 return impl.read(path, columns=columns, **kwargs)

~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
129 kwargs['use_pandas_metadata'] = True
130 result = self.api.parquet.read_table(path, columns=columns,
--> 131 **kwargs).to_pandas()
132 if should_close:
133 try:

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, nthreads, metadata, use_pandas_metadata)
1044 fs = _get_fs_from_path(source)
1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
-> 1046 use_pandas_metadata=use_pandas_metadata)
1047 
1048 pf = ParquetFile(source, metadata=metadata)

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in read_parquet(self, path, columns, metadata, schema, nthreads, use_pandas_metadata)
175 filesystem=self)
176 return dataset.read(columns=columns, nthreads=nthreads,
--> 177 use_pandas_metadata=use_pandas_metadata)
178 
179 def open(self, path, mode='rb'):

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
896 partitions=self.partitions,
897 open_file_func=open_file,
--> 898 use_pandas_metadata=use_pandas_metadata)
899 tables.append(table)
900 

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, partitions, open_file_func, file, use_pandas_metadata)
459 table = reader.read_row_group(self.row_group, **options)
460 else:
--> 461 table = reader.read(**options)
462 
463 if len(self.partition_keys) > 0:

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
150 columns, use_pandas_metadata=use_pandas_metadata)
151 return self.reader.read_all(column_indices=column_indices,
--> 152 nthreads=nthreads)
153 
154 def scan_contents(self, columns=None, batch_size=65536):

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all()

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowIOError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

parquet-issue-example.py
28/Aug/18 20:40
0.3 kB
Jeremy Heffner

Issue Links

links to

GitHub Pull Request #4230

GitHub Pull Request #4242

Activity

People

Assignee:: Deepak Majeti

Reporter:: Jeremy Heffner

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Aug/18 20:49

Updated:: 23/Aug/19 16:01

Resolved:: 03/May/19 02:20

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: