Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.3.0
-
None
Description
We use .saveAsTable and dynamic partitioning as our only way to write data to S3 from Spark.
When only 1 partition column is defined for a table, .saveAsTable behaves as expected:
- with Overwrite mode it will create a table if it doesn't exist and write the data
- with Append mode it will append to a given partition
- with Overwrite mode if the table exists it will overwrite the partition
If 2 partition columns are used however, the directory is created on S3 with the SUCCESS file, but no data is actually written
our solution is to check if the table doesn't exist, and in that case, set the partitioning mode back to static before running saveAsTable:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") df.write.mode("overwrite").partitionBy("year", "month").option("path", "s3://hbc-data-warehouse/integration/users_test").saveAsTable("users_test")