Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2342

Parquet writer produced a corrupted file due to page value count overflow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.14.0
    • parquet-mr
    • None
    • Patch

    Description

      Parquet writer only checks the number of rows and the page size to decide whether it needs to fit a content to be written in a single page. 

      In the case of a composite column (ex: array/map) with a lot of nulls, it is possible to create 2billions+ values while under the default page-size & row-count threshold (1MB, 20000rows)

       

      Repro using Spark:

            val dir = "/tmp/anyrandomDirectory"

            spark.range(0, 20000, 1, 1)
              .selectExpr("array_repeat(cast(null as binary), 110000) as n")
              .write
              .mode("overwrite")
              .save(dir)

            val result = spark
              .sql(s"select * from parquet.`$dir` limit 1000")
              .collect() // This will break

      Attachments

        Activity

          People

            majdyz Zamil Majdy
            majdyz Zamil Majdy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: