Description
Currently, unionByName requires two DataFrames to have the same set of columns (even though the order can be different). It would be good to add either an option to unionByName or a new type of union which fills in missing columns with nulls.
val df1 = Seq(1, 2, 3).toDF("x") val df2 = Seq("a", "b", "c").toDF("y") df1.unionByName(df2)
This currently throws
org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among (y);
Ideally, there would be a way to make this return a DataFrame containing:
+----+----+ | x| y| +----+----+ | 1|null| | 2|null| | 3|null| |null| a| |null| b| |null| c| +----+----+
Currently the workaround to make this possible is by using unionByName, but this is clunky:
df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
Attachments
Issue Links
- is related to
-
SPARK-32798 Make unionByName optionally fill missing columns with nulls in PySpark
- Resolved
-
SPARK-32799 Make unionByName optionally fill missing columns with nulls in SparkR
- Resolved
- relates to
-
SPARK-32308 Move by-name resolution logic of unionByName from API code to analysis phase
- Resolved
- links to