Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.4.8
-
None
-
java.version = 1.8
spark.version = 2.4.8
hadoop.version = 3.1.3File Output Committer Algorithm version is 2
FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
Description
I have such structure of table where I want to write DF:
CREATE EXTERNAL TABLE `usl_rdm_idl_spark_stg.okogu_h`( `ctl_loading` bigint, `ctl_validfrom` timestamp, `end_dt` date, `okogu_accept_dt` date) PARTITIONED BY ( `p1day` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://FESS-DEV/data/usl/rdm_idl_spark/stg/okogu_h' TBLPROPERTIES ( 'bucketing_version'='2', 'spark.sql.partitionProvider'='catalog', 'transient_lastDdlTime'='1654082666')
Final DF has the same structure as mentioned table structure. The issue happens when attr "p1day" (table is partitioned by this attr) has null value only. So when I try to write it with any option
finalDF.write().mode(SaveMode.Append).partitionBy("p1day").parquet("somepath);
or
finalDF.write().mode(SaveMode.Append).insertInto(String.format("%s.%s", tgtSchema, tgtTable));
I see such error:
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.fs.FileAlreadyExistsException: /data/usl/rdm_idl_spark/stg/okogu_h/.hive-staging_hive_2022-06-01_16-59-37_442_6329951430234699240-1/-ext-10000/_temporary/0/_temporary/attempt_20220601165937_0116_m_000001_586/p1day=__HIVE_DEFAULT_PARTITION__/part-00001-05999af9-8a25-406e-a307-f97781547db2.c000 for client 10.106.105.11 already exists
For me it works correctly only when I replace null value in "p1day" column with any value( for ex. "1"):
finalDF.withColumn("p1day",lit("1"));
Is it a bug in spark-sql code? I use org.apache.spark:spark-sql_2.11:2.4.8