Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
It has been observed in customer deployments that sporadically it so happens that using Spark, an "Insert overwrite" generates an Invalid / Corrupted Parquet file. There are no exceptions and tasks writing, are committed successfully and shutdown is graceful, once all the tasks are done.
However when the written files are read data corruption occurs, which on analysis shows that "Expected 15356 uncompressed bytes but got 15108"
deficit of 248 bytes.
Given the low frequency of occurrence, suspicion is that output stream is closed before fully flushing the buffered the data.
So suggestion is to add a flush method, between writing footer and closing the stream, in the end() method of
org.apache.parquet.hadoop.ParquetFileWriter
Attachments
Issue Links
- links to