Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-26722

HiveFilterSetOpTransposeRule incorrectly prunes UNION ALL operands

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0-alpha-1
    • None
    • CBO

    Description

      Reproducer

      Consider the following query:

      set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\);
      
      CREATE EXTERNAL TABLE t (a string, b string);
      INSERT INTO t VALUES ('1000', 'b1');
      INSERT INTO t VALUES ('2000', 'b2');
      
      SELECT * FROM (
        SELECT
         a,
         b
        FROM t
         UNION ALL
        SELECT
         a,
         CAST(NULL AS string)
         FROM t) AS t2
      WHERE a = 1000;EXPLAIN CBO
      SELECT * FROM (
        SELECT
         a,
         b
        FROM t
         UNION ALL
        SELECT
         a,
         CAST(NULL AS string)
         FROM t) AS t2
      WHERE a = 1000; 

      The expected result is:

      1000    b1
      1000    NULL

      An example of correct plan is as follows:

      CBO PLAN:
      HiveUnion(all=[true])
        HiveProject(a=[$0], b=[$1])
          HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
            HiveTableScan(table=[[default, t]], table:alias=[t])
        HiveProject(a=[$0], _o__c1=[null:VARCHAR(2147483647) CHARACTER SET "UTF-16LE"])
          HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
            HiveTableScan(table=[[default, t]], table:alias=[t])

       

      Consider now a scenario where expression reduction in projections is disabled by setting the following property:

      set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\);
      

      In this case, the simplification of CAST(NULL) into NULL does not happen, and we get the following (invalid) result:

      1000    b1

      produced by the following invalid plan:

      CBO PLAN:
      HiveProject(a=[$0], b=[$1])
        HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
          HiveTableScan(table=[[default, t]], table:alias=[t]) 

      Problem Analysis

      At HiveFilterSetOpTransposeRule.java#L112 the RelMetadataQuery::getPulledUpPredicates method infers the following predicate due to the CAST(NULL) in the projection:

      (=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE"))

      When the CAST is simplified to the NULL literal, the IS_NULL($1) predicate is inferred.

      In HiveFilterSetOpTransposeRule.java#L114-L122, the rule checks if the conjunction of the predicate coming from the filter (here =(CAST($0):DOUBLE, 1000)) and the inferred predicates is satisfiable or not, under the UnknownAsFalse semantics.

      To summarize, the following expression is simplified under the UnknownAsFalse semantics:

      AND((=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")), =(CAST($0):DOUBLE, 1000))
      

      Under In such semantics, (=($1, CAST(null:NULL):...) evaluates to FALSE, because no value is equal to NULL (even NULL itself), AND(FALSE,  =(CAST($0):DOUBLE, 1000)) necessarily evaluates to FALSE altogether, and the UNION ALL operand is pruned.

      Only by chance, when CAST(NULL) is simplified to NULL, we avoid the issue, due to the IS_NULL($1) inferred predicate, see HiveRelMdPredicates.java#L153-L156 for understanding how the NULL literal is treated differently during predicate inference.

      HiveRelMdPredicates should not use equality ('=') for nullable constant expressions, but rather IS NOT DISTINCT FROM, as detailed in HIVE-26733, but nonetheless the way simplification is done is not correct here, inferred predicates should be used as "context", rather than been used in a conjunctive expression, this usage does not conform with any of the similar uses of simplification with inferred predicates (see the bottom of the "Solution" section for examples and details).

      Solution

      In order to correctly simplify a predicate and test if it's always false or not, we should build RexSimplify with predicates as the list of predicates known to hold in the context. In this way, the different semantics are correctly taken into account.

      The code at HiveFilterSetOpTransposeRule.java#L114-L121 should be replaced by the following:

      final RexExecutor executor =
      Util.first(filterRel.getCluster().getPlanner().getExecutor(), RexUtil.EXECUTOR);
      final RexSimplify simplify = new RexSimplify(rexBuilder, predicates, executor);
      final RexNode x = simplify.simplifyUnknownAs(newCondition, RexUnknownAs.FALSE);

      This is in line with other uses of simplification, like in Calcite:

      https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L704-L705

      You can see the above method used for different kind of rels, always in the way identical to what this ticket advocates:filters:
      https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L152-L155

      projects:
      https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L306-L310

      joins:
      https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L380-L383

      windows:
      https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L567-L576

      Attachments

        Issue Links

          Activity

            People

              asolimando Alessandro Solimando
              asolimando Alessandro Solimando
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m