Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.2
-
None
-
None
-
I'm using :
IntelliJ IDEA ==> 2018.1.4
spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
scala ==> 2.11.8
Description
When using UNION clause after a GROUP BY clause in spark, the results obtained are wrong.
The following example explicit this issue:
CREATE TABLE test_un (
col1 varchar(255),
col2 varchar(255),
col3 varchar(255),
col4 varchar(255)
);
INSERT INTO test_un (col1, col2, col3, col4)
VALUES (1,1,2,4),
(1,1,2,4),
(1,1,3,5),
(2,2,2,null);
I used the following code :
val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un") val y = x .filter(col("col4")isNotNull) .groupBy("col1", "col2","col3") .agg(count(col("col3")).alias("cnt")) .withColumn("col_name", lit("col3")) .select(col("col1"), col("col2"), col("col_name"),col("col3").alias("col_value"), col("cnt")) val z = x .filter(col("col4")isNotNull) .groupBy("col1", "col2","col4") .agg(count(col("col4")).alias("cnt")) .withColumn("col_name", lit("col4")) .select(col("col1"), col("col2"), col("col_name"),col("col4").alias("col_value"), col("cnt")) y.union(z).show()
And i obtained the following results:
col1 | col2 | col_name | col_value | cnt |
---|---|---|---|---|
1 | 1 | col3 | 5 | 1 |
1 | 1 | col3 | 4 | 2 |
1 | 1 | col4 | 5 | 1 |
1 | 1 | col4 | 4 | 2 |
Expected results:
col1 | col2 | col_name | col_value | cnt |
---|---|---|---|---|
1 | 1 | col3 | 3 | 1 |
1 | 1 | col3 | 2 | 2 |
1 | 1 | col4 | 4 | 2 |
1 | 1 | col4 | 5 | 1 |
But when i remove the last row of the table, i obtain the correct results.
(2,2,2,null)