Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-19097

related equals and in operators may cause inaccurate stats estimations

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 4.0.0-alpha-1
    • None
    • None

    Description

      tpcds#74 is optimized in a way that for date_dim the condition contains IN and = for the same column

      |             Map Operator Tree:                     |
      |                 TableScan                          |
      |                   alias: date_dim                  |
      |                   filterExpr: (((d_year) IN (2001, 2002) and (d_year = 2002) and d_date_sk is not null) or ((d_year) IN (2001, 2002) and (d_year = 2001) and d_date_sk is not null)) (type: boolean) |
      |                   Statistics: Num rows: 73049 Data size: 876588 Basic stats: COMPLETE Column stats: COMPLETE |
      |                   Filter Operator                  |
      |                     predicate: ((d_year) IN (2001, 2002) and (d_year = 2002) and d_date_sk is not null) (type: boolean) |
      |                     Statistics: Num rows: 4 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE |
      

      the "real" row count will be 365

      for separate IN and = the estimation is very good; but if both are present it becomes (very) underestimated.

      set hive.query.results.cache.enabled=false;
      
      drop table if exists t1;
      drop table if exists t8;
      
      create table t1 (a integer,b integer);
      create table t8 like t1;
      
      insert into t1 values (1,1),(2,2),(3,3),(4,4),(5,5);
      
      insert into t8
      select * from t1 union all select * from t1 union all select * from t1 union all select * from t1 union all
      select * from t1 union all select * from t1 union all select * from t1 union all select * from t1
      ;
      
      analyze table t1 compute statistics for columns;
      analyze table t8 compute statistics for columns;
      
      explain analyze select sum(a) from t8 where b in (2,3) group by b;
      explain analyze select sum(a) from t8 where b=2 group by b;
      
      explain analyze select sum(a) from t1 where b in (2,3) and b=2 group by b;
      explain analyze select sum(a) from t8 where b in (2,3) and b=2 group by b;
      
      

      Attachments

        1. HIVE-19097.01.patch
          26 kB
          Zoltan Haindrich
        2. HIVE-19097.02.patch
          7 kB
          Zoltan Haindrich
        3. HIVE-19097.03.patch
          8 kB
          Zoltan Haindrich
        4. HIVE-19097.04.patch
          8 kB
          Zoltan Haindrich
        5. HIVE-19097.05.patch
          519 kB
          Zoltan Haindrich
        6. HIVE-19097.06.patch
          475 kB
          Zoltan Haindrich
        7. HIVE-19097.06wip01.patch
          485 kB
          Zoltan Haindrich
        8. HIVE-19097.06wip02.patch
          484 kB
          Zoltan Haindrich
        9. HIVE-19097.07.patch
          475 kB
          Zoltan Haindrich
        10. HIVE-19097.08.patch
          456 kB
          Zoltan Haindrich
        11. HIVE-19097.08.patch
          456 kB
          Zoltan Haindrich
        12. HIVE-19097.09.patch
          455 kB
          Zoltan Haindrich
        13. HIVE-19097.10.patch
          476 kB
          Zoltan Haindrich
        14. HIVE-19097.11.patch
          640 kB
          Zoltan Haindrich
        15. HIVE-19097.12.patch
          708 kB
          Zoltan Haindrich
        16. HIVE-19097.13.patch
          714 kB
          Ashutosh Chauhan
        17. HIVE-19097.partial.patch
          6 kB
          Zoltan Haindrich

        Issue Links

          Activity

            People

              kgyrtkirk Zoltan Haindrich
              kgyrtkirk Zoltan Haindrich
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: