Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25313

Fix regression in FileFormatWriter output schema

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.3.2, 2.4.0
    • SQL
    • None

    Description

      In the follow example:

      val location = "/tmp/t"
      val df = spark.range(10).toDF("id")
      df.write.format("parquet").saveAsTable("tbl")
      spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
      spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location")
      spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
      println(spark.read.parquet(location).schema)
      spark.table("tbl2").show()

      The output column name in schema will be id instead of ID, thus the last query shows nothing from tbl2.
      By enabling the debug message we can see that the output naming is changed from ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.

      To guarantee correctness, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer.

      Attachments

        Issue Links

          Activity

            People

              Gengliang.Wang Gengliang Wang
              Gengliang.Wang Gengliang Wang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: