Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26709

OptimizeMetadataOnlyQuery does not correctly handle the files with zero record

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.1.0, 2.1.3, 2.2.3, 2.3.2, 2.4.0
    • 2.3.3, 2.4.1, 3.0.0
    • SQL

    Description

      import org.apache.spark.sql.functions.lit
      withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
        withTempPath { path =>
          val tabLocation = path.getAbsolutePath
          val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
          val df = spark.emptyDataFrame.select(lit(1).as("col1"))
          df.write.parquet(partLocation.toString)
          val readDF = spark.read.parquet(tabLocation)
          checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
          checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
        }
      }
      

      OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the empty records for partitioned tables. The above test will fail in 2.4, which can generate an empty file, but the underlying issue in the read path still exists in 2.3, 2.2 and 2.1.

      Attachments

        Issue Links

          Activity

            People

              Gengliang.Wang Gengliang Wang
              smilegator Xiao Li
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: