[PARQUET-2454] Invoking flush before closing the output stream in ParquetFileWriter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.14.0
Component/s: parquet-mr
Labels:
- parquetWriter

Description

It has been observed in customer deployments that sporadically it so happens that using Spark, an "Insert overwrite" generates an Invalid / Corrupted Parquet file. There are no exceptions and tasks writing, are committed successfully and shutdown is graceful, once all the tasks are done.

However when the written files are read data corruption occurs, which on analysis shows that "Expected 15356 uncompressed bytes but got 15108"

deficit of 248 bytes.

Given the low frequency of occurrence, suspicion is that output stream is closed before fully flushing the buffered the data.

So suggestion is to add a flush method, between writing footer and closing the stream, in the end() method of

org.apache.parquet.hadoop.ParquetFileWriter

Attachments

Issue Links

links to

GitHub Pull Request #1309

Activity

People

Assignee:: Asif

Reporter:: Asif

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Mar/24 18:35

Updated:: 23/Apr/24 03:38

Resolved:: 23/Apr/24 02:42