Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35739

[Spark Sql] Add Java-comptable Dataset.join overloads

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0, 3.0.0
    • 3.4.0
    • Java API, SQL
    • None

    Description

      Problem

      When using Spark SQL with Java, the required syntax to utilize the following two overloads are unnatural and not obvious to developers that haven't had to interoperate with Scala before:

      def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
      def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
      

      Examples:

      Java 11 

      Dataset<Row> dataset1 = ...;
      Dataset<Row> dataset2 = ...;
      
      // Overload with multiple usingColumns, no join type
      dataset1
        .join(dataset2, JavaConverters.asScalaBuffer(List.of("column", "column2))
        .show();
      
      // Overload with multiple usingColumns and a join type
      dataset1
        .join(
          dataset2,
          JavaConverters.asScalaBuffer(List.of("column", "column2")),
          "left")
        .show();
      

       
      Additionally there is no overload that takes a single usingColumnn and a joinType, forcing the developer to use the Seq[String] overload regardless of language.

      Examples:

      Scala

      val dataset1 :DataFrame = ...;
      val dataset2 :DataFrame = ...;
      
      dataset1
        .join(dataset2, Seq("column"), "left")
        .show();
      

       
      Java 11

      Dataset<Row> dataset1 = ...;
      Dataset<Row> dataset2 = ...;
      
      dataset1
       .join(dataset2, JavaConverters.asScalaBuffer(List.of("column")), "left")
       .show();
      

      Proposed Improvement

      Add 3 additional overloads to Dataset:
       

      def join(right: Dataset[_], usingColumn: List[String]): DataFrame
      def join(right: Dataset[_], usingColumn: String, joinType: String): DataFrame
      def join(right: Dataset[_], usingColumn: List[String], joinType: String): DataFrame
      

      Attachments

        Activity

          People

            brandon.dahler.amazon Brandon Dahler
            brandon.dahler.amazon Brandon Dahler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: