Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-13019

Optimizer COLLECT_LIST/COLLECT_SET

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • CBO, Logical Optimizer
    • None

    Description

      Currently when using a COLLECT_SET/COLLECT_LIST that involves data from a single table, the aggregation is done after any JOIN operation that is present in the query. For example:

      insert into table nested_customers_orders
      select c.*, collect_list(named_struct("oid", o.oid, "order_date": o.date...))
      from customers c inner join orders o on (c.cid = o.oid)
      group by o.oid, o.date,...
      

      If we can tell the optimizer to perform the COLLECT_LIST first (where possible) we can see some performance gains in this pattern of query.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cotedm Dustin Cote
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: