Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45580

Subquery changes the output schema of the outer query

    XMLWordPrintableJSON

Details

    Description

      A query can have an incorrect output schema because of a subquery.

      Assume this data:

      create or replace temp view t1(a) as values (1), (2), (3), (7);
      create or replace temp view t2(c1) as values (1), (2), (3);
      create or replace temp view t3(col1) as values (3), (9);
      cache table t1;
      cache table t2;
      cache table t3;
      

      When run in spark-sql, the following query has a superfluous boolean column:

      select *
      from t1
      where exists (
        select c1
        from t2
        where a = c1
        or a in (select col1 from t3)
      );
      
      1	false
      2	false
      3	true
      

      The result should be:

      1
      2
      3
      

      When executed via the Dataset API, you don't see the incorrect result, because the Dataset API truncates the right-side of the rows based on the analyzed plan's schema (it's the optimized plan's schema that goes wrong).

      However, even with the Dataset API, this query goes wrong:

      select (
        select *
        from t1
        where exists (
          select c1
          from t2
          where a = c1
          or a in (select col1 from t3)
        )
        limit 1
      )
      from range(1);
      
      java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis
      	at scala.Predef$.assert(Predef.scala:279)
      	at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
      	at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
      	at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
      	at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
              ...
      

      Other queries that have the wrong schema:

      select *
      from t1
      where a in (
        select c1
        from t2
        where a in (select col1 from t3)
      );
      

      and

      select *
      from t1
      where not exists (
        select c1
        from t2
        where a = c1
        or a in (select col1 from t3)
      );
      

      Attachments

        Activity

          People

            bersprockets Bruce Robbins
            bersprockets Bruce Robbins
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: