[SPARK-43156] Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null` - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.1
Component/s: SQL
Labels:
- correctness

Description

Example query:

spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) from t0").collect()
res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])

In this subquery, count(1) always evaluates to a non-null integer value, so count(1) is null is always false. The correct evaluation of the subquery is always false.

We incorrectly evaluate it to null for empty groups. The reason is that NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] [false] - this rewrite would be correct normally, but in the context of a scalar subquery it breaks our count bug handling in RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, the query appears to not have the count bug - it looks the same as if the original query had a subquery with select any_value(false) from r..., and that case is not subject to the count bug.

Postgres comparison show correct always-false result: http://sqlfiddle.com/#!17/67822/5

DDL for the example:

create or replace temp view t0 (a, b)
as values
    (1, 1.0),
    (2, 2.0);
create or replace temp view t1 (c, d)
as values
    (2, 3.0);

Attachments

Issue Links

links to

[Github] Pull Request #40865 (Hisoka-X)

[Github] Pull Request #40946 (jchen5)

[Github] Pull Request #40977 (Hisoka-X)

Activity

People

Assignee:: Jack Chen

Reporter:: Jack Chen

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 16/Apr/23 20:29

Updated:: 02/May/23 15:47

Resolved:: 02/May/23 15:46