Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39379

FileAlreadyExistsException while insertInto() DF to hive table or directly write().parquet()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.8
    • None
    • SQL
    • java.version = 1.8
      spark.version = 2.4.8
      hadoop.version = 3.1.3

      File Output Committer Algorithm version is 2

      FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false

    Description

       

      I have such structure of table where I want to write DF:

      CREATE EXTERNAL TABLE `usl_rdm_idl_spark_stg.okogu_h`(
        `ctl_loading` bigint,
        `ctl_validfrom` timestamp,
        `end_dt` date,
        `okogu_accept_dt` date)
      PARTITIONED BY (
        `p1day` string)
      ROW FORMAT SERDE
        'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
      STORED AS INPUTFORMAT
        'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
      OUTPUTFORMAT
        'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
      LOCATION
        'hdfs://FESS-DEV/data/usl/rdm_idl_spark/stg/okogu_h'
      TBLPROPERTIES (
        'bucketing_version'='2',
        'spark.sql.partitionProvider'='catalog',
        'transient_lastDdlTime'='1654082666')
      

      Final DF has the same structure as mentioned table structure. The issue happens when attr "p1day" (table is partitioned by this attr) has null value only. So when I try to write it with any option 

      finalDF.write().mode(SaveMode.Append).partitionBy("p1day").parquet("somepath);

       or

      finalDF.write().mode(SaveMode.Append).insertInto(String.format("%s.%s", tgtSchema, tgtTable));

      I see such error:

      Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.fs.FileAlreadyExistsException: /data/usl/rdm_idl_spark/stg/okogu_h/.hive-staging_hive_2022-06-01_16-59-37_442_6329951430234699240-1/-ext-10000/_temporary/0/_temporary/attempt_20220601165937_0116_m_000001_586/p1day=__HIVE_DEFAULT_PARTITION__/part-00001-05999af9-8a25-406e-a307-f97781547db2.c000 for client 10.106.105.11 already exists

       

      For me it works correctly only when I replace null value in "p1day" column with any value( for ex. "1"):

      finalDF.withColumn("p1day",lit("1"));

       

      Is it a bug in spark-sql code? I use org.apache.spark:spark-sql_2.11:2.4.8 

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            filimonov1-ve Filimonov Valentin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: