Description
Problem
When using Spark SQL with Java, the required syntax to utilize the following two overloads are unnatural and not obvious to developers that haven't had to interoperate with Scala before:
def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
Examples:
Java 11
Dataset<Row> dataset1 = ...; Dataset<Row> dataset2 = ...; // Overload with multiple usingColumns, no join type dataset1 .join(dataset2, JavaConverters.asScalaBuffer(List.of("column", "column2)) .show(); // Overload with multiple usingColumns and a join type dataset1 .join( dataset2, JavaConverters.asScalaBuffer(List.of("column", "column2")), "left") .show();
Additionally there is no overload that takes a single usingColumnn and a joinType, forcing the developer to use the Seq[String] overload regardless of language.
Examples:
Scala
val dataset1 :DataFrame = ...; val dataset2 :DataFrame = ...; dataset1 .join(dataset2, Seq("column"), "left") .show();
Java 11
Dataset<Row> dataset1 = ...; Dataset<Row> dataset2 = ...; dataset1 .join(dataset2, JavaConverters.asScalaBuffer(List.of("column")), "left") .show();
Proposed Improvement
Add 3 additional overloads to Dataset:
def join(right: Dataset[_], usingColumn: List[String]): DataFrame def join(right: Dataset[_], usingColumn: String, joinType: String): DataFrame def join(right: Dataset[_], usingColumn: List[String], joinType: String): DataFrame