Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29299

Intermittently getting "Cannot create the managed table error" while creating table from spark 2.4

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.0
    • None
    • Spark Core

    Description

      We are facing below error in spark 2.4 intermittently when saving the managed table from spark.

      Error -
      pyspark.sql.utils.AnalysisException: u"Can not create the managed table('`hive_issue`.`table`'). The associated location('s3://{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table') already exists.;"

      Steps to reproduce--
      1. Create dataframe from spark mid size data (30MB CSV file)
      2. Save dataframe as a table
      3. Terminate the session when above mentioned operation is in progress

      Note--
      Session termination is just a way to reproduce this issue. In real time we are facing this issue intermittently when we are running same spark jobs multiple times. We use EMRFS and HDFS from EMR cluster and we face the same issue on both of the systems.
      The only ways we can fix this is by deleting the target folder where table will keep its files which is not option for us and we need to keep historical information in the table hence we use APPEND mode while writing to table.

      Sample code--
      from pyspark.sql import SparkSession
      sc = SparkSession.builder.enableHiveSupport().getOrCreate()
      df = sc.read.csv("s3://{sample-bucket}1/DATA/consumecomplians.csv")
      print "STARTED WRITING TO TABLE"

      1. Terminate session using ctrl + c after this statement post df.write action started
        df.write.mode("append").saveAsTable("hive_issue.table")
        print "COMPLETED WRITING TO TABLE"

      We went through the documentation of spark 2.4 [1] and found that spark is no longer allowing to create manage tables on non empty folders.

      1. Any reason behind change in the spark behavior
      2. To us it looks like a breaking change as despite specifying "overwrite" option spark in unable to wipe out existing data and create tables
      3. Do we have any solution for this issue other that setting "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag

      [1]
      https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            abhijeet_bedagkar Abhijeet
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: