Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29358

Make unionByName optionally fill missing columns with nulls

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • SQL
    • None

    Description

      Currently, unionByName requires two DataFrames to have the same set of columns (even though the order can be different). It would be good to add either an option to unionByName or a new type of union which fills in missing columns with nulls. 

      val df1 = Seq(1, 2, 3).toDF("x")
      val df2 = Seq("a", "b", "c").toDF("y")
      df1.unionByName(df2)

      This currently throws 

      org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among (y);
      

      Ideally, there would be a way to make this return a DataFrame containing:

      +----+----+ 
      | x| y| 
      +----+----+ 
      | 1|null| 
      | 2|null| 
      | 3|null| 
      |null| a| 
      |null| b| 
      |null| c| 
      +----+----+
      

      Currently the workaround to make this possible is by using unionByName, but this is clunky:

      df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
      

      Attachments

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              mukulmurthy Mukul Murthy
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: