Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-26368

DISTINCT keyword from Count UDF is removed from query plan when CBO is enabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • CBO
    • None

    Description

      Reproduction steps:

      cat ql/src/test/queries/clientpositive/ctas_distinct.q

      create database test_db;
      create table test_db.test_tb
      (col1 string,
      col2 int,
      col3 int,
      col4 date);
      insert into test_db.test_tb values('a',1,2,'2022-01-01');
      insert into test_db.test_tb values('a',11,2,'2022-01-02');
      insert into test_db.test_tb values('a',1,2,'2022-01-01');
      insert into test_db.test_tb values('a',1,22,'2022-01-02');
      insert into test_db.test_tb values('a',11,2,'2022-01-01');
      set hive.log.explain.output=true;
      create table test as
      SELECT col1
      , col2
      , COUNT(DISTINCT col3, col4) AS susp_visit_count
      FROM test_db.test_tb
      GROUP BY col1
      , col2;

      When we run the above test case,

      mvn install -Pitests -pl itests/qtest -Dtest=TestMiniLlapLocalCliDriver -Dqfile=ctas_distinct.q -Dtest.output.overwrite

      the below exception is thrown

      2022-07-04T09:22:02,949 ERROR [76039186-5579-4a9b-b787-6d92083f1bb9 main] parse.CalcitePlanner: CBO failed, skipping CBO.
      org.apache.hadoop.hive.ql.exec.UDFArgumentException: DISTINCT keyword must be specified
              at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount.getEvaluator(GenericUDAFCount.java:73) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getGenericUDAFEvaluator(FunctionRegistry.java:1255) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator2(SemanticAnalyzer.java:4974) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator(SemanticAnalyzer.java:4966) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:5651) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggrNoSkew(SemanticAnalyzer.java:6578) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:11077) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12009) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11875) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:631) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12714) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:460) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:317) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:224) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:106) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:507) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:459) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:424) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:418) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:121) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:227) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255) [hive-cli-4.0.0-alpha-2-SNAPSHOT.jar:?]
              at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:200) [hive-cli-4.0.0-alpha-2-SNAPSHOT.jar:?]
              at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:126) [hive-cli-4.0.0-alpha-2-SNAPSHOT.jar:?]
              at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:421) [hive-cli-4.0.0-alpha-2-SNAPSHOT.jar:?]
              at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:352) [hive-cli-4.0.0-alpha-2-SNAPSHOT.jar:?]
              at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:727) [hive-it-util-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
              at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:697) [hive-it-util-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]

      Snippets from Log

       

      2022-07-04T09:22:02,848 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] parse.CalcitePlanner: Initial CBO Plan:
      HiveProject(col1=[-zsh], col2=[], susp_visit_count=[])
        HiveAggregate(group=[{0, 1}], agg#0=[count(DISTINCT , )])
          HiveProject(=[-zsh], =[], =[], =[])
            HiveTableScan(table=[[test_db, test_tb]], table:alias=[test_tb])
      ...
      ...
      2022-07-04T09:22:02,900 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] calcite.RelOptHiveTable: Stats for column col1 in table test_tb stored in cache
      2022-07-04T09:22:02,900 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] calcite.RelOptHiveTable:  colName: col1 colType: string countDistincts: 1 numNulls: 0 avgColLen: 1.0 numTrues: 0 numFalses: 0 isPrimaryKey: false isEstimated: false
      2022-07-04T09:22:02,900 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] calcite.RelOptHiveTable: Stats for column col2 in table test_tb stored in cache
      2022-07-04T09:22:02,900 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] calcite.RelOptHiveTable:  colName: col2 colType: int countDistincts: 2 numNulls: 0 avgColLen: 4.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 11 ] isPrimaryKey: false isEstimated: false
      2022-07-04T09:22:02,900 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] calcite.RelOptHiveTable: Stats for column col3 in table test_tb stored in cache
      2022-07-04T09:22:02,900 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] calcite.RelOptHiveTable:  colName: col3 colType: int countDistincts: 2 numNulls: 0 avgColLen: 4.0 numTrues: 0 numFalses: 0 Range: [ min: 2 max: 22 ] isPrimaryKey: false isEstimated: false
      2022-07-04T09:22:02,900 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] calcite.RelOptHiveTable: Stats for column col4 in table test_tb stored in cache
      2022-07-04T09:22:02,900 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] calcite.RelOptHiveTable:  colName: col4 colType: date countDistincts: 0 numNulls: 0 avgColLen: 56.0 numTrues: 0 numFalses: 0 Range: [ min: 18993 max: 18994 ] isPrimaryKey: false isEstimated: false
      2022-07-04T09:22:02,900 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] rules.HiveRelFieldTrimmer: Got col stats for [0, 1, 2, 3] in [test_db, test_tb]
      2022-07-04T09:22:02,924 DEBUG [76039186-5579-4a9b-b787-6d92083f1bb9 main] rules.RelFieldTrimmer: Plan after trimming unused fields
      HiveAggregate(group=[{0, 1}], agg#0=[count(, )])
        HiveAggregate(group=[{0, 1, 2, 3}])
          HiveProject(=[-zsh], =[], =[], =[])
            HiveTableScan(table=[[test_db, test_tb]], table:alias=[test_tb])
      

      So i believe Distinct is being skipped from the plan, when we turn off CBO, then the query works as expected

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            tarak271 Taraka Rama Rao Lethavadla
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: