Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-6188

Multi-query optimization

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Devise extensions to SQL for multi-query optimization (MQO). Queries with multiple input tables, multiple intermediate tables, and multiple output data sets (result sets and DML operations) can be defined in one SQL statement, optimized, and executed atomically.

      There are many flavors of multi-query optimization, depending on whether each occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or "several". Our goal, here, is to allow them all to be expressed. We can then devise planning strategies that work for particular flavors.

      Examples of multi-queries:

      • Multiple DML outputs. An INSERT statement that writes into a table but also updates an index,
      • Multiple DML outputs, complex intermediate tables. A DAG that represents an ETL/ELT job;
      • Multiple query outputs. A query that produces several data sets (say a list of invoices for
        orders and a list of products that need to be restocked);
      • DAG query. A query that uses intermediate results more than once.

      See discussion in the Multi-query optimization email thread.

      Here are some SQL examples.

      We add a new keyword MULTI that represents a statement whose output contains multiple data sets and DML operations, each with a unique name. For intermediate results, we use the existing WITH clause.

      1. Multi DML

      Read from one or more tables, write to one or more tables.
      An example is inserting into a table and also an index on that table (represented as a sorted table).

      WITH
        empDelta AS (
            SELECT * FROM emp2
            EXCEPT
            SELECT * FROM emp)
      MULTI
        insertEmp AS (
          INSERT INTO emp
          TABLE empDelta),
        insertEmpDeptno AS (
          MERGE empDeptno AS e
          USING TABLE empDelta AS d
          ON e.deptno = d.deptno
          WHEN NOT MATCHED THEN INSERT VALUES (deptno));
      

      2. Query that creates temporary table and uses it more than once

      WITH
        temp AS (
          SELECT *
          FROM emp AS e
          JOIN dept USING (deptno)
          WHERE e.job = 'MANAGER'
          OR d.location = 'CHICAGO')
      SELECT deptno,
        (SELECT AVG(sal)
          FROM temp AS t
          WHERE t.deptno = e.deptno) AS deptAvgSal,
        (SELECT AVG(sal)
          FROM temp AS t
          WHERE t.job = e.job) AS jobAvgSal
      FROM e
      WHERE e.deptno IN (10, 20);
      

      3. Query whose optimal plan might use a temporary table

      This query produces the same result as the previous query. There is a common relational expression, so the optimizer should consider a DAG plan with a reified intermediate result.

      SELECT deptno,
        (SELECT AVG(e2.sal)
          FROM emp AS e2
          JOIN dept AS d USING (deptno)
          WHERE (e2.job = 'MANAGER'
            OR d.location = 'CHICAGO')
          AND e2.deptno = e.deptno) AS deptAvgSal,
        (SELECT AVG(e3.sal)
          FROM emp AS e3
          JOIN dept AS d USING (deptno)
          WHERE (e3.job = 'MANAGER'
            OR d.location = 'CHICAGO')
          AND e3.job = e.job) AS jobAvgSal
      FROM e
      WHERE e.deptno IN (10, 20);
      

      4. Query that produces several data sets

      WITH
        newOrders AS (
          SELECT *
          FROM orders
          WHERE orderDate > DATE '2023-01-25')
      MULTI
        invoices AS (
          SELECT customerName, SUM(amount)
          FROM newOrders
          GROUP BY customerName),
        restock AS (
          SELECT productId
          FROM inventory
          WHERE productId IN (
            SELECT productId FROM newOrders)
          AND itemsOnHand < 10);
      

      5. Query with a complex DAG, multiple output data sets and one DML

      WITH
        t0 AS (
          SELECT * FROM t WHERE x > 5),
        t00 AS (
          SELECT * FROM t0 WHERE y < 10),
        t000 AS (
          SELECT x, MIN(y) FROM t00 GROUP BY x),
        t1 AS (
          SELECT * FROM t WHERE y > 3),
        t10 AS (
          SELECT * FROM t1 WHERE x < 8),
        t2 AS (
          SELECT * FROM t000
          INTERSECT
          SELECT * FROM t10),
        t3 AS (
          SELECT * FROM u
          INTERSECT
          SELECT * FROM t00)
      MULTI
        q0 AS (SELECT * FROM t3),
        q1 AS (SELECT * FROM t2),
        d0 AS (DELETE FROM v WHERE v.x IN (SELECT x FROM t1));
      

      The data flow is the following DAG:

      u --------------------+
                            +--------------> t3 (q0)
          +--> t0 --> t00 --+
          |                 +--> t000 --+
      t --+                             +--> t2 (q1)
          |         +--> t10 -----------+
          +--> t1 --+
                    +----------------------> d0
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              julianhyde Julian Hyde
              Votes:
              2 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated: