[SPARK-46093] append to parquet file with column type changed corrupts fie - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.3.0
Fix Version/s: None
Component/s: Input/Output
Labels:
None

Description

from pyspark.sql.functions import *
from pyspark.sql.types import *

fnBad = "dbfs:/tmp/richard.gooding@os.uk/test_bad_parquet/f1"
df = spark.createDataFrame( [ ["aaaa" ] ] ).select( col("_1").alias("aa") )
df.printSchema()

fmt = "parquet"

fmt = "delta"
df.write.mode("overwrite").format( fmt ) .save( fnBad )
df.show()

df = df.withColumn( "aa", struct( col("aa")) ) # change type of column - error on load
df.printSchema()
df.show()
df.write.mode("append").format( fmt).save( fnBad ) # format = delta : "AnalysisException: Failed to merge fields 'aa' and 'aa'. Failed to merge incompatible data types StringType and StructType(StructField(aa,StringType,true))"

df.write.mode("append").option("mergeSchema", "true").format(fmt).save( fnBad ) # gives a different error, but only when dataframe read

print(" — at df 2 — ")
df2 = spark.read.format(fmt).load( fnBad )

df2 = spark.read.option("mergeSchema", "true").format(fmt).load( fnBad )
df2.show() # this will error -

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: richard gooding

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Nov/23 15:06

Updated:: 24/Nov/23 15:06