Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24606

Multi-stage materialized CTEs can lose intermediate data

    XMLWordPrintableJSON

Details

    Description

      With complex multi-stage CTEs, Hive can start a latter stage before its previous stage finishes.
      That's because `SemanticAnalyzer#toRealRootTasks` can fail to resolve dependency between multistage materialized CTEs when a non-materialized CTE cuts in.
      https://github.com/apache/hive/blob/425e1ff7c054f87c4db87e77d004282d529599ae/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L1414

       

      For example, when submitting this query,

      SET hive.optimize.cte.materialize.threshold=2;
      SET hive.optimize.cte.materialize.full.aggregate.only=false;
      
      WITH x AS ( SELECT 'x' AS id ), -- not materialized
      a1 AS ( SELECT 'a1' AS id ), -- materialized by a2 and the root
      a2 AS ( SELECT 'a2 <- ' || id AS id FROM a1) -- materialized by the root
      SELECT * FROM a1
      UNION ALL
      SELECT * FROM x
      UNION ALL
      SELECT * FROM a2
      UNION ALL
      SELECT * FROM a2;
      

      `toRealRootTask` will traverse the CTEs in order of `a1`, `x`, and `a2`. It means the dependency between `a1` and `a2` will be ignored and `a2` can start without waiting for `a1`. As a result, the above query returns the following result.

      +-----+
      | id  |
      +-----+
      | a1  |
      | x   |
      +-----+
      

      For your information, I ran this test with revision = 425e1ff7c054f87c4db87e77d004282d529599ae.

      Attachments

        Activity

          People

            okumin Shohei Okumiya
            okumin Shohei Okumiya
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m