[SPARK-29299] Intermittently getting "Cannot create the managed table error" while creating table from spark 2.4 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed

Description

We are facing below error in spark 2.4 intermittently when saving the managed table from spark.

Error -
pyspark.sql.utils.AnalysisException: u"Can not create the managed table('`hive_issue`.`table`'). The associated location('s3://{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table') already exists.;"

Steps to reproduce--
1. Create dataframe from spark mid size data (30MB CSV file)
2. Save dataframe as a table
3. Terminate the session when above mentioned operation is in progress

Note--
Session termination is just a way to reproduce this issue. In real time we are facing this issue intermittently when we are running same spark jobs multiple times. We use EMRFS and HDFS from EMR cluster and we face the same issue on both of the systems.
The only ways we can fix this is by deleting the target folder where table will keep its files which is not option for us and we need to keep historical information in the table hence we use APPEND mode while writing to table.

Sample code--
from pyspark.sql import SparkSession
sc = SparkSession.builder.enableHiveSupport().getOrCreate()
df = sc.read.csv("s3://{sample-bucket}1/DATA/consumecomplians.csv")
print "STARTED WRITING TO TABLE"

Terminate session using ctrl + c after this statement post df.write action started
df.write.mode("append").saveAsTable("hive_issue.table")
print "COMPLETED WRITING TO TABLE"

We went through the documentation of spark 2.4 [1] and found that spark is no longer allowing to create manage tables on non empty folders.

1. Any reason behind change in the spark behavior
2. To us it looks like a breaking change as despite specifying "overwrite" option spark in unable to wipe out existing data and create tables
3. Do we have any solution for this issue other that setting "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag

[1]
https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Abhijeet

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Sep/19 05:06

Updated:: 25/May/21 01:54

Resolved:: 25/May/21 01:40