Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9132 CBO: Calcite Operator To Hive Operator (Calcite Return Path)
  3. HIVE-10194

CBO (Calcite Return Path): Equi join followed by theta join produces a cross product

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • CBO
    • None

    Description

      Query

      SELECT count(distinct ws_order_number) as order_count,
                     sum(ws_ext_ship_cost) as total_shipping_cost,
                     sum(ws_net_profit) as total_net_profit
      FROM web_sales ws1
      JOIN customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk)
      JOIN web_site s ON (ws1.ws_web_site_sk = s.web_site_sk)
      JOIN date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk)
      LEFT SEMI JOIN (SELECT ws2.ws_order_number as ws_order_number
                                     FROM web_sales ws2 JOIN web_sales ws3
                                     ON (ws2.ws_order_number = ws3.ws_order_number)
                                     WHERE ws2.ws_warehouse_sk <> ws3.ws_warehouse_sk
      			) ws_wh1
      ON (ws1.ws_order_number = ws_wh1.ws_order_number)
      LEFT OUTER JOIN web_returns wr1 ON (ws1.ws_order_number = wr1.wr_order_number)
      WHERE d.d_date between '1999-05-01' and '1999-07-01' and
                     ca.ca_state = 'TX' and
                     s.web_company_name = 'pri' and
                     wr1.wr_order_number is null
      limit 100
      

      Plan

      OK
      Time taken: 0.23 seconds
      Warning: Map Join MAPJOIN[83][bigTable=ws1] in task 'Map 2' is a cross product
      OK
      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
            Edges:
              Map 2 <- Map 1 (BROADCAST_EDGE)
              Map 8 <- Reducer 4 (BROADCAST_EDGE)
              Reducer 3 <- Map 2 (SIMPLE_EDGE), Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE), Map 7 (SIMPLE_EDGE)
              Reducer 4 <- Map 10 (SIMPLE_EDGE), Reducer 3 (SIMPLE_EDGE)
              Reducer 9 <- Map 8 (SIMPLE_EDGE)
            DagName: mmokhtar_20150402132417_1bc8688b-59a0-4909-82a4-b9d386065bbd:3
            Vertices:
              Map 1
                  Map Operator Tree:
                      TableScan
                        alias: ws1
                        filterExpr: (((ws_ship_addr_sk = ws_order_number) and (ws_ship_date_sk <> ws_web_site_sk)) and ws_ship_addr_sk is not null) (type: boolean)
                        Statistics: Num rows: 143966864 Data size: 33110363004 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: (((ws_ship_addr_sk = ws_order_number) and (ws_ship_date_sk <> ws_web_site_sk)) and ws_ship_addr_sk is not null) (type: boolean)
                          Statistics: Num rows: 71974471 Data size: 1151483592 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: ws_ship_addr_sk (type: int)
                            outputColumnNames: _col1
                            Statistics: Num rows: 71974471 Data size: 287862044 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              sort order:
                              Statistics: Num rows: 71974471 Data size: 287862044 Basic stats: COMPLETE Column stats: COMPLETE
                              value expressions: _col1 (type: int)
                  Execution mode: vectorized
              Map 10
                  Map Operator Tree:
                      TableScan
                        alias: wr1
                        Statistics: Num rows: 13749816 Data size: 2585240312 Basic stats: COMPLETE Column stats: COMPLETE
                        Reduce Output Operator
                          key expressions: wr_order_number (type: int)
                          sort order: +
                          Map-reduce partition columns: wr_order_number (type: int)
                          Statistics: Num rows: 13749816 Data size: 2585240312 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized
              Map 2
                  Map Operator Tree:
                      TableScan
                        alias: ws1
                        Statistics: Num rows: 143966864 Data size: 33110363004 Basic stats: COMPLETE Column stats: COMPLETE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          keys:
                            0
                            1
                          outputColumnNames: _col1
                          input vertices:
                            0 Map 1
                          Statistics: Num rows: 5180969438964472 Data size: 20723877755857888 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: _col1 (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 5180969438964472 Data size: 20723877755857888 Basic stats: COMPLETE Column stats: COMPLETE
                            Group By Operator
                              keys: _col0 (type: int)
                              mode: hash
                              outputColumnNames: _col0
                              Statistics: Num rows: 73333928460636 Data size: 293335713842544 Basic stats: COMPLETE Column stats: COMPLETE
                              Reduce Output Operator
                                key expressions: _col0 (type: int)
                                sort order: +
                                Map-reduce partition columns: _col0 (type: int)
                                Statistics: Num rows: 73333928460636 Data size: 293335713842544 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized
              Map 5
                  Map Operator Tree:
                      TableScan
                        alias: ca
                        filterExpr: ((ca_state = 'TX') and ca_address_sk is not null) (type: boolean)
                        Statistics: Num rows: 800000 Data size: 811903688 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: ((ca_state = 'TX') and ca_address_sk is not null) (type: boolean)
                          Statistics: Num rows: 15686 Data size: 1411740 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: ca_address_sk (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 15686 Data size: 62744 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 15686 Data size: 62744 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized
              Map 6
                  Map Operator Tree:
                      TableScan
                        alias: d
                        filterExpr: (d_date BETWEEN '1999-05-01' AND '1999-07-01' and d_date_sk is not null) (type: boolean)
                        Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: (d_date BETWEEN '1999-05-01' AND '1999-07-01' and d_date_sk is not null) (type: boolean)
                          Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: d_date_sk (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized
              Map 7
                  Map Operator Tree:
                      TableScan
                        alias: ws1
                        filterExpr: (((ws_order_number is not null and ws_ship_addr_sk is not null) and ws_ship_date_sk is not null) and ws_web_site_sk is not null) (type: boolean)
                        Statistics: Num rows: 143966864 Data size: 33110363004 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: (((ws_order_number is not null and ws_ship_addr_sk is not null) and ws_ship_date_sk is not null) and ws_web_site_sk is not null) (type: boolean)
                          Statistics: Num rows: 143912892 Data size: 3453621540 Basic stats: COMPLETE Column stats: COMPLETE
                          Reduce Output Operator
                            key expressions: ws_order_number (type: int)
                            sort order: +
                            Map-reduce partition columns: ws_order_number (type: int)
                            Statistics: Num rows: 143912892 Data size: 3453621540 Basic stats: COMPLETE Column stats: COMPLETE
                            value expressions: ws_ship_date_sk (type: int), ws_ship_addr_sk (type: int), ws_web_site_sk (type: int), ws_ext_ship_cost (type: float), ws_net_profit (type: float)
                  Execution mode: vectorized
              Map 8
                  Map Operator Tree:
                      TableScan
                        alias: s
                        filterExpr: ((web_company_name = 'pri') and web_site_sk is not null) (type: boolean)
                        Statistics: Num rows: 38 Data size: 70614 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: ((web_company_name = 'pri') and web_site_sk is not null) (type: boolean)
                          Statistics: Num rows: 5 Data size: 460 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: web_site_sk (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column stats: COMPLETE
                            Map Join Operator
                              condition map:
                                   Inner Join 0 to 1
                              keys:
                                0 _col30 (type: int)
                                1 _col0 (type: int)
                              outputColumnNames: _col31, _col32, _col33
                              input vertices:
                                0 Reducer 4
                              Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE Column stats: NONE
                              Select Operator
                                expressions: _col31 (type: int), _col32 (type: float), _col33 (type: float)
                                outputColumnNames: _col0, _col1, _col2
                                Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE Column stats: NONE
                                Group By Operator
                                  aggregations: count(DISTINCT _col0), sum(_col1), sum(_col2)
                                  keys: _col0 (type: int)
                                  mode: hash
                                  outputColumnNames: _col0, _col1, _col2, _col3
                                  Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE Column stats: NONE
                                  Reduce Output Operator
                                    key expressions: _col0 (type: int)
                                    sort order: +
                                    Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE Column stats: NONE
                                    TopN Hash Memory Usage: 0.04
                                    value expressions: _col2 (type: double), _col3 (type: double)
                  Execution mode: vectorized
              Reducer 3
                  Reduce Operator Tree:
                    Merge Join Operator
                      condition map:
                           Left Semi Join 0 to 1
                      keys:
                        0 ws_order_number (type: int)
                        1 _col0 (type: int)
                      outputColumnNames: _col2, _col11, _col13, _col17, _col28, _col33
                      Statistics: Num rows: 718857633877870 Data size: 17252583213068880 Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col11 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col2, _col13, _col17, _col28, _col33
                        input vertices:
                          1 Map 5
                        Statistics: Num rows: 14094999486464 Data size: 281899989729280 Basic stats: COMPLETE Column stats: COMPLETE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          keys:
                            0 _col2 (type: int)
                            1 _col0 (type: int)
                          outputColumnNames: _col13, _col17, _col28, _col33
                          input vertices:
                            1 Map 6
                          Statistics: Num rows: 7047403274240 Data size: 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: _col13 (type: int), _col17 (type: int), _col28 (type: float), _col33 (type: float)
                            outputColumnNames: _col2, _col3, _col4, _col5
                            Statistics: Num rows: 7047403274240 Data size: 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col3 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col3 (type: int)
                              Statistics: Num rows: 7047403274240 Data size: 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE
                              value expressions: _col2 (type: int), _col4 (type: float), _col5 (type: float)
              Reducer 4
                  Reduce Operator Tree:
                    Merge Join Operator
                      condition map:
                           Right Outer Join0 to 1
                      keys:
                        0 wr_order_number (type: int)
                        1 _col3 (type: int)
                      outputColumnNames: _col13, _col30, _col31, _col32, _col33
                      Statistics: Num rows: 12100482980189 Data size: 242009659603780 Basic stats: COMPLETE Column stats: COMPLETE
                      Filter Operator
                        predicate: _col13 is null (type: boolean)
                        Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: COMPLETE
                        Reduce Output Operator
                          key expressions: _col30 (type: int)
                          sort order: +
                          Map-reduce partition columns: _col30 (type: int)
                          Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: COMPLETE
                          value expressions: _col31 (type: int), _col32 (type: float), _col33 (type: float)
              Reducer 9
                  Reduce Operator Tree:
                    Group By Operator
                      aggregations: count(DISTINCT KEY._col0:0._col0), sum(VALUE._col1), sum(VALUE._col2)
                      mode: mergepartial
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
                      Limit
                        Number of rows: 100
                        Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-0
          Fetch Operator
            limit: 100
            Processor Tree:
              ListSink
      

      Logical plan

      HiveSort(fetch=[100]): rowcount = 1.0, cumulative cost = {4.594004456323317E8 rows, 5.882012320482085E9 cpu, 9.353802456E12 io}, id = 1080
        HiveAggregate(group=[{}], agg#0=[count(DISTINCT $0)], agg#1=[sum($1)], agg#2=[sum($2)]): rowcount = 1.0, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1078
          HiveProject($f0=[$3], $f1=[$4], $f2=[$5]): rowcount = 464.4485504122314, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1076
            HiveFilter(condition=[isnull($12)]): rowcount = 464.4485504122314, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1074
              HiveProject(ws_ship_date_sk=[$1], ws_ship_addr_sk=[$2], ws_web_site_sk=[$3], ws_order_number=[$4], ws_ext_ship_cost=[$5], ws_net_profit=[$6], ca_address_sk=[$7], ca_state=[$8], web_site_sk=[$11], web_company_name=[$12], d_date_sk=[$9], d_date=[$10], wr_order_number=[$0]): rowcount = 1229395.3129411766, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1185
                HiveJoin(condition=[=($3, $11)], joinType=[inner], joinAlgorithm=[map_join], cost=[{1.3749818235294119E7 rows, 1.3749820470588237E7 cpu, 0.0 io}]): rowcount = 1229395.3129411766, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1183
                  HiveJoin(condition=[=($4, $0)], joinType=[right], joinAlgorithm=[map_join], cost=[{1.3749817E7 rows, 2.7499633E7 cpu, 2.47496688E10 io}]): rowcount = 1.3749816E7, cumulative cost = {4.456506153970376E8 rows, 5.868262499011497E9 cpu, 9.353802456E12 io}, id = 1181
                    HiveTableScan(table=[[tpcds_bin_orc_200.web_returns]]): rowcount = 1.3749816E7, cumulative cost = {0}, id = 974
                    HiveJoin(condition=[=($0, $8)], joinType=[inner], joinAlgorithm=[map_join], cost=[{204.39703763146147 rows, 205.39703763146147 cpu, 0.0 io}]): rowcount = 1.0, cumulative cost = {4.319007983970376E8 rows, 5.840762866011497E9 cpu, 9.3290527872E12 io}, id = 1179
                      HiveJoin(condition=[=($1, $6)], joinType=[inner], joinAlgorithm=[map_join], cost=[{1.43966865E8 rows, 1.43966866E8 cpu, 0.0 io}]): rowcount = 203.39703763146147, cumulative cost = {4.31900594E8 rows, 5.840762660614459E9 cpu, 9.3290527872E12 io}, id = 1110
                        SemiJoin(condition=[=($3, $6)], joinType=[inner]): rowcount = 1.43966864E8, cumulative cost = {2.87933729E8 rows, 5.696795794614459E9 cpu, 9.3290527872E12 io}, id = 1058
                          HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount = 1.43966864E8, cumulative cost = {0}, id = 958
                          HiveProject(ws_order_number=[$1]): rowcount = 1.655760644524185E11, cumulative cost = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1056
                            HiveFilter(condition=[<>($0, $2)]): rowcount = 1.655760644524185E11, cumulative cost = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1054
                              HiveJoin(condition=[=($1, $3)], joinType=[inner], joinAlgorithm=[common_join], cost=[{2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}]): rowcount = 1.6564734127740878E11, cumulative cost = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1103
                                HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount = 1.43966864E8, cumulative cost = {0}, id = 958
                                HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount = 1.43966864E8, cumulative cost = {0}, id = 958
                        HiveFilter(condition=[=($1, 'TX')]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1060
                          HiveTableScan(table=[[tpcds_bin_orc_200.customer_address]]): rowcount = 800000.0, cumulative cost = {0}, id = 965
                      HiveFilter(condition=[between(false, $1, '1999-05-01', '1999-07-01')]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1068
                        HiveTableScan(table=[[tpcds_bin_orc_200.date_dim]]): rowcount = 73049.0, cumulative cost = {0}, id = 971
                  HiveFilter(condition=[=($1, 'pri')]): rowcount = 2.235294117647059, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1064
                    HiveTableScan(table=[[tpcds_bin_orc_200.web_site]]): rowcount = 38.0, cumulative cost = {0}, id = 968
      

      Attachments

        Issue Links

          Activity

            People

              jpullokkaran Laljo John Pullokkaran
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: