Details
Description
When I try to self join a parentDf with multiple childDf say childDf1 ... ...
where childDfs are derived after a cube or rollup and are filtered based on group bys,
I get and error
Failure when resolving conflicting references in Join:
This shows a long error message which is quite unreadable. On the other hand, if I replace cube or rollup with old groupBy, it works without issues.
Sample code:
val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums") val cubeDF = numsDF .cube("nums") .agg( max(lit(0)).as("agcol"), grouping_id().as("gid") ) val group0 = cubeDF.filter(col("gid") <=> lit(0)) val group1 = cubeDF.filter(col("gid") <=> lit(1)) cubeDF.printSchema group0.printSchema group1.printSchema //Recreating cubeDf cubeDF.select("nums").distinct .join(group0, Seq("nums"), "inner") .join(group1, Seq("nums"), "inner") .show
Sample output:
numsDF: org.apache.spark.sql.DataFrame = [nums: int] cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more field] group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, agcol: int ... 1 more field] group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, agcol: int ... 1 more field] root |-- nums: integer (nullable = true) |-- agcol: integer (nullable = true) |-- gid: integer (nullable = false) root |-- nums: integer (nullable = true) |-- agcol: integer (nullable = true) |-- gid: integer (nullable = false) root |-- nums: integer (nullable = true) |-- agcol: integer (nullable = true) |-- gid: integer (nullable = false) org.apache.spark.sql.AnalysisException: Failure when resolving conflicting references in Join: 'Join Inner :- Deduplicate [nums#220] : +- Project [nums#220] : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS agcol#216, spark_grouping_id#218 AS gid#217] : +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], [nums#212, nums#220, spark_grouping_id#218] : +- Project [nums#212, nums#212 AS nums#219] : +- Project [value#210 AS nums#212] : +- SerializeFromObject [input[0, int, false] AS value#210] : +- ExternalRDD [obj#209] +- Filter (gid#217 <=> 0) +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS agcol#216, spark_grouping_id#218 AS gid#217] +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], [nums#212, nums#220, spark_grouping_id#218] +- Project [nums#212, nums#212 AS nums#219] +- Project [value#210 AS nums#212] +- SerializeFromObject [input[0, int, false] AS value#210] +- ExternalRDD [obj#209] Conflicting attributes: nums#220 ;; 'Join Inner :- Deduplicate [nums#220] : +- Project [nums#220] : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS agcol#216, spark_grouping_id#218 AS gid#217] : +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], [nums#212, nums#220, spark_grouping_id#218] : +- Project [nums#212, nums#212 AS nums#219] : +- Project [value#210 AS nums#212] : +- SerializeFromObject [input[0, int, false] AS value#210] : +- ExternalRDD [obj#209] +- Filter (gid#217 <=> 0) +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS agcol#216, spark_grouping_id#218 AS gid#217] +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], [nums#212, nums#220, spark_grouping_id#218] +- Project [nums#212, nums#212 AS nums#219] +- Project [value#210 AS nums#212] +- SerializeFromObject [input[0, int, false] AS value#210] +- ExternalRDD [obj#209] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:335) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:96) at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:109) at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:202) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:106) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66) at org.apache.spark.sql.Dataset.join(Dataset.scala:939) ... 46 elided
Attachments
Issue Links
- links to