[ARROW-7867] [Python] ArrowIOError: Invalid Parquet file size is 0 bytes on reading from S3 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 0.15.1, 0.16.0
Fix Version/s: None
Component/s: Python
Labels:
- dataset-parquet-read

External issue URL:
https://github.com/apache/arrow/issues/24093

Description

I'm not sure if this issue belongs here or to S3FS library.

The error occurs when reading from partitioned parquet from S3, in case when the "root folder" of the parquet was created manually before writing the parquet there.

I.e. the steps to reproduce:

# 1. Create "folder" s3://bucket.name/data.parquet in e.g. cyberduck app

# 2. Write
table = pa.Table.from_pandas(df)
pq.write_table(table, 's3://bucket.name/data.parquet', partition_cols=[], filesystem=s3fs.S3FileSystem())

# 3. Read
pq.read_table('s3://bucket.name/data.parquet', filesystem=s3fs.S3FileSystem())
# ArrowIOError: Invalid Parquet file size is 0 bytes

In case when the table was partitioned by a non-empty set of columns, an error reads: "ValueError: Found files in an intermediate directory".

This is likely due to the fact that S3 does not have "folders" per-se, and various software "mimic" creation of empty folder by writing an empty (zero-size) object to S3. So the parquet confuses this object with the actual contents of the parquet file.

At the same time s3fs library correctly identifies the key as a folder:

s3fs.S3FileSystem().isdir('s3://bucket.name/data.parquet')  # Returns True

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Vladimir

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Feb/20 19:36

Updated:: 11/Jan/23 07:56