Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9132 CBO: Calcite Operator To Hive Operator (Calcite Return Path)
  3. HIVE-10412

CBO : Calculate join selectivity when computing HiveJoin cost

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • CBO
    • None

    Description

      This is from TPC-DS Q7
      Because we don't compute the selectivity of sub-expression in a HiveJoin we assume that selective and non-selective joins have the similar cost.

      select  i_item_id, 
              avg(ss_quantity) agg1,
              avg(ss_list_price) agg2,
              avg(ss_coupon_amt) agg3,
              avg(ss_sales_price) agg4 
       from store_sales, customer_demographics, item
       where store_sales.ss_item_sk = item.i_item_sk and
             store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk and
             cd_gender = 'F' and 
             cd_marital_status = 'W' and
             cd_education_status = 'Primary'
       group by i_item_id
       order by i_item_id
       limit 100
      

      Cardinality

      item 462,000
      customer_demographics 1,920,800
      store_sales 82,510,879,939
      

      NDVs

      item.i_item_sk 439501
      customer_demographics.cd_demo_sk 1835839
      store_sales.ss_cdemo_sk 1835839
      

      From the logs

      2015-04-20 21:09:58,055 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(60)) - Join algorithm selection for:
      HiveJoin(condition=[=($0, $10)], joinType=[inner], algorithm=[none], cost=[not available])
        HiveJoin(condition=[=($1, $6)], joinType=[inner], algorithm=[MapJoin], cost=[{8.251089518344444E10 rows, 2.324083308641975E8 cpu, 275417.5666666666 io}])
          HiveProject(ss_item_sk=[$1], ss_cdemo_sk=[$3], ss_quantity=[$9], ss_list_price=[$11], ss_sales_price=[$12], ss_coupon_amt=[$18])
            HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.store_sales]])
          HiveProject(cd_demo_sk=[$0], cd_gender=[$1], cd_marital_status=[$2], cd_education_status=[$3])
            HiveFilter(condition=[AND(=($1, 'F'), =($2, 'W'), =($3, 'Primary'))])
              HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.customer_demographics]])
        HiveProject(i_item_sk=[$0], i_item_id=[$1])
          HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.item]])
      
      2015-04-20 21:09:58,056 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(69)) - CommonJoin cost: {6.553102534841269E8 rows, 4.0217814199458417E18 cpu, 3.499540319862703E7 io}
      2015-04-20 21:09:58,056 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(69)) - MapJoin cost: {6.553102534841269E8 rows, 2.13444462E11 cpu, 1.0720709999999998E7 io}
      2015-04-20 21:09:58,056 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(78)) - MapJoin selected
      2015-04-20 21:09:58,057 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(60)) - Join algorithm selection for:
      HiveJoin(condition=[=($1, $8)], joinType=[inner], algorithm=[none], cost=[not available])
        HiveJoin(condition=[=($0, $6)], joinType=[inner], algorithm=[MapJoin], cost=[{8.2511341939E10 rows, 2.13444462E11 cpu, 1.0720709999999998E7 io}])
          HiveProject(ss_item_sk=[$1], ss_cdemo_sk=[$3], ss_quantity=[$9], ss_list_price=[$11], ss_sales_price=[$12], ss_coupon_amt=[$18])
            HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.store_sales]])
          HiveProject(i_item_sk=[$0], i_item_id=[$1])
            HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.item]])
        HiveProject(cd_demo_sk=[$0], cd_gender=[$1], cd_marital_status=[$2], cd_education_status=[$3])
          HiveFilter(condition=[AND(=($1, 'F'), =($2, 'W'), =($3, 'Primary'))])
            HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.customer_demographics]])
      
      2015-04-20 21:09:58,058 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(69)) - CommonJoin cost: {8.251089518344444E10 rows, 2.6089279242468144E21 cpu, 4.901146588836599E9 io}
      2015-04-20 21:09:58,058 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(69)) - MapJoin cost: {8.251089518344444E10 rows, 2.324083308641975E8 cpu, 275417.5666666666 io}
      

      Attachments

        Activity

          People

            jpullokkaran Laljo John Pullokkaran
            mmokhtar Mostafa Mokhtar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: