[SPARK-29358] Make unionByName optionally fill missing columns with nulls - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.1.0
Component/s: SQL
Labels:
None

Description

Currently, unionByName requires two DataFrames to have the same set of columns (even though the order can be different). It would be good to add either an option to unionByName or a new type of union which fills in missing columns with nulls.

val df1 = Seq(1, 2, 3).toDF("x")
val df2 = Seq("a", "b", "c").toDF("y")
df1.unionByName(df2)

This currently throws

org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among (y);

Ideally, there would be a way to make this return a DataFrame containing:

+----+----+ 
| x| y| 
+----+----+ 
| 1|null| 
| 2|null| 
| 3|null| 
|null| a| 
|null| b| 
|null| c| 
+----+----+

Currently the workaround to make this possible is by using unionByName, but this is clunky:

df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))

Attachments

Issue Links

is related to

SPARK-32798 Make unionByName optionally fill missing columns with nulls in PySpark

Resolved

SPARK-32799 Make unionByName optionally fill missing columns with nulls in SparkR

Resolved

relates to

SPARK-32308 Move by-name resolution logic of unionByName from API code to analysis phase

Resolved

links to

[Github] Pull Request #28996 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Mukul Murthy

Votes:: 2 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 04/Oct/19 17:26

Updated:: 12/Dec/22 18:10

Resolved:: 11/Jul/20 22:59