Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46093

append to parquet file with column type changed corrupts fie

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.3.0
    • None
    • Input/Output
    • None

    Description

      from pyspark.sql.functions import *
      from pyspark.sql.types import *

      fnBad = "dbfs:/tmp/richard.gooding@os.uk/test_bad_parquet/f1"
      df = spark.createDataFrame( [ ["aaaa" ] ] ).select( col("_1").alias("aa") )
      df.printSchema()

      fmt = "parquet"

      1. fmt = "delta"
        df.write.mode("overwrite").format( fmt ) .save( fnBad )
        df.show()

      df = df.withColumn( "aa", struct( col("aa")) ) # change type of column - error on load
      df.printSchema()
      df.show()
      df.write.mode("append").format( fmt).save( fnBad ) # format = delta :   "AnalysisException: Failed to merge fields 'aa' and 'aa'. Failed to merge incompatible data types StringType and StructType(StructField(aa,StringType,true))"

      1. df.write.mode("append").option("mergeSchema", "true").format(fmt).save( fnBad ) # gives a different error, but only when dataframe read

      print(" — at df 2 — ")
      df2 = spark.read.format(fmt).load( fnBad )

      1. df2 = spark.read.option("mergeSchema", "true").format(fmt).load( fnBad )
        df2.show()  # this will error - 

      Attachments

        Activity

          People

            Unassigned Unassigned
            rich84t richard gooding
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: