Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35553 Improve correlated subqueries
  3. SPARK-43156

Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.1
    • SQL

    Description

      Example query:

      spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) from t0").collect()
      res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  

      In this subquery, count(1) always evaluates to a non-null integer value, so count(1) is null is always false. The correct evaluation of the subquery is always false.

      We incorrectly evaluate it to null for empty groups. The reason is that NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] [false] - this rewrite would be correct normally, but in the context of a scalar subquery it breaks our count bug handling in RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, the query appears to not have the count bug - it looks the same as if the original query had a subquery with select any_value(false) from r..., and that case is not subject to the count bug.

       

      Postgres comparison show correct always-false result: http://sqlfiddle.com/#!17/67822/5

      DDL for the example:

      create or replace temp view t0 (a, b)
      as values
          (1, 1.0),
          (2, 2.0);
      create or replace temp view t1 (c, d)
      as values
          (2, 3.0); 

      Attachments

        Activity

          People

            jchen5 Jack Chen
            jchen5 Jack Chen
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: