[SPARK-27282] Spark incorrect results when using UNION with GROUP BY clause - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: Spark Shell, Spark Submit, SQL
Labels:
None
Environment:

I'm using :

IntelliJ IDEA ==> 2018.1.4

spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)

scala ==> 2.11.8

Description

When using UNION clause after a GROUP BY clause in spark, the results obtained are wrong.

The following example explicit this issue:

CREATE TABLE test_un (
col1 varchar(255),
col2 varchar(255),
col3 varchar(255),
col4 varchar(255)
);

INSERT INTO test_un (col1, col2, col3, col4)
VALUES (1,1,2,4),
(1,1,2,4),
(1,1,3,5),
(2,2,2,null);

I used the following code :

val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")

val  y = x
   .filter(col("col4")isNotNull)
  .groupBy("col1", "col2","col3")
  .agg(count(col("col3")).alias("cnt"))
  .withColumn("col_name", lit("col3"))
  .select(col("col1"), col("col2"), col("col_name"),col("col3").alias("col_value"), col("cnt"))

val z = x
  .filter(col("col4")isNotNull)
  .groupBy("col1", "col2","col4")
  .agg(count(col("col4")).alias("cnt"))
  .withColumn("col_name", lit("col4"))
  .select(col("col1"), col("col2"), col("col_name"),col("col4").alias("col_value"), col("cnt"))

y.union(z).show()

And i obtained the following results:

col1	col2	col_name	col_value	cnt
1	1	col3	5	1
1	1	col3	4	2
1	1	col4	5	1
1	1	col4	4	2

Expected results:

col1	col2	col_name	col_value	cnt
1	1	col3	3	1
1	1	col3	2	2
1	1	col4	4	2
1	1	col4	5	1

But when i remove the last row of the table, i obtain the correct results.

(2,2,2,null)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sofia

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 26/Mar/19 09:28

Updated:: 12/Dec/22 18:10

Resolved:: 19/Feb/20 07:35