Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2454

Invoking flush before closing the output stream in ParquetFileWriter

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.14.0
    • parquet-mr

    Description

      It has been observed in customer deployments that sporadically it so happens that using Spark, an "Insert overwrite"  generates an Invalid / Corrupted Parquet  file. There are no exceptions and tasks writing, are committed successfully and shutdown is graceful, once all the tasks are done.

      However when the written files are read data corruption occurs, which on analysis shows that "Expected 15356 uncompressed bytes but got 15108"

      deficit of 248 bytes.

      Given the low frequency of occurrence, suspicion is that output stream is closed before fully flushing the buffered the data.

      So suggestion is to add a flush method, between writing footer and closing the stream, in the end() method of 

      org.apache.parquet.hadoop.ParquetFileWriter

      Attachments

        Issue Links

          Activity

            People

              ashahid7 Asif
              ashahid7 Asif
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: