Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20623

Shared work: Extend sharing of map-join cache entries in LLAP

    XMLWordPrintableJSON

Details

    Description

      For a query like this

      with all_sales as (
      select ss_customer_sk as customer_sk, ss_ext_list_price-ss_ext_discount_amt as ext_price from store_sales
      UNION ALL
      select ws_bill_customer_sk as customer_sk, ws_ext_list_price-ws_ext_discount_amt as ext_price from web_sales
      UNION ALL
      select cs_bill_customer_sk as customer_sk, cs_ext_sales_price - cs_ext_discount_amt as ext_price from catalog_sales)
      select sum(ext_price) total_price, c_customer_id from all_sales, customer 
      where customer_sk = c_customer_sk
      group by c_customer_id
      order by total_price desc 
      limit 100;
      

      The hashtable used for all 3 joins are identical, which is loaded 3x times in the same LLAP instance because they are named.

          cacheKey = "HASH_MAP_" + this.getOperatorId() + "_container";
      

      in the cache.

      If those are identical in nature (i.e vectorization, hashtable type etc), then the duplication is just wasted CPU, memory and network - using the cache name for hashtables which will be identical in layout would be extremely useful.

      In cases where the join is pushed through a UNION, those are identical.

      This optimization can only be done without concern for accidental delays when the same upstream task is generating all of these hashtables, which is what is achieved by the shared scan optimizer already.

      In case the shared work is not present, this has potential downsides - in case two customer broadcasts were sourced from "Map 1" and "Map 2", the Map 1 builder will block the other task from reading from Map 2, even though Map 2 might have started after, but finished ahead of Map 1.

      So this specific optimization can always be considered for cases where the shared work unifies the operator tree and the parents of all the RS entries involved are same (& the RS layout is the same).

      Attachments

        1. hash-shared-work.json.txt
          52 kB
          Gopal Vijayaraghavan
        2. hash-shared-work.svg
          158 kB
          Gopal Vijayaraghavan
        3. HIVE-20623.01.patch
          8 kB
          jcamachorodriguez
        4. HIVE-20623.02.patch
          8 kB
          jcamachorodriguez
        5. HIVE-20623.02.patch
          8 kB
          jcamachorodriguez
        6. HIVE-20623.02.patch
          8 kB
          jcamachorodriguez
        7. HIVE-20623.03.patch
          9 kB
          jcamachorodriguez
        8. HIVE-20623.03.patch
          9 kB
          jcamachorodriguez
        9. HIVE-20623.03.patch
          9 kB
          jcamachorodriguez
        10. HIVE-20623.04.patch
          9 kB
          jcamachorodriguez
        11. HIVE-20623.04.patch
          9 kB
          jcamachorodriguez
        12. HIVE-20623.04.patch
          9 kB
          jcamachorodriguez
        13. HIVE-20623.04.patch
          9 kB
          jcamachorodriguez
        14. HIVE-20623.04.patch
          9 kB
          jcamachorodriguez
        15. HIVE-20623.patch
          8 kB
          jcamachorodriguez

        Issue Links

          Activity

            People

              jcamacho Jesús Camacho Rodríguez
              gopalv Gopal Vijayaraghavan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: