Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28156

Join plan sometimes does not use cached query

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.3, 2.4.3, 3.0.0
    • 2.3.4, 2.4.4, 3.0.0
    • SQL
    • None

    Description

      I came across a case where a cached query is referenced on both sides of a join, but the InMemoryRelation is inserted on only one side. This case occurs only when the cached query uses a (Hive-style) view.

      Consider this example:

      // create the data
      val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", "c", "d")
      df1.write.mode("overwrite").format("orc").saveAsTable("table1")
      sql("drop view if exists table1_vw")
      sql("create view table1_vw as select * from table1")
      
      // create the cached query
      val cacheddataDf = sql("""
      select a, b, c, d
      from table1_vw
      """)
      
      import org.apache.spark.storage.StorageLevel.DISK_ONLY
      cacheddataDf.createOrReplaceTempView("cacheddata")
      cacheddataDf.persist(DISK_ONLY)
      
      // main query
      val queryDf = sql(s"""
      select leftside.a, leftside.b
      from cacheddata leftside
      join cacheddata rightside
      on leftside.a = rightside.a
      """)
      
      queryDf.explain(true)
      

      Note that the optimized plan does not use an InMemoryRelation for the right side, but instead just uses a Relation:

      Project [a#45, b#46]
      +- Join Inner, (a#45 = a#37)
         :- Project [a#45, b#46]
         :  +- Filter isnotnull(a#45)
         :     +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 replicas)
         :           +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int,b:int,c:int,d:int>
         +- Project [a#37]
            +- Filter isnotnull(a#37)
               +- Relation[a#37,b#38,c#39,d#40] orc
      
      

      The fragment does not match the cached query because AliasViewChild adds an extra projection under the View on the right side (see #2 below).

      AliasViewChild adds the extra projection because the exprIds in the View's output appears to have been renamed by Analyzer$ResolveReferences (#1 below). I have not yet looked at why.

      -
      -
      -
         +- SubqueryAlias `rightside`
            +- SubqueryAlias `cacheddata`
               +- Project [a#73, b#74, c#75, d#76]
                  +- SubqueryAlias `default`.`table1_vw`
      (#1) ->        +- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76])
      (#2) ->           +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76]
                           +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48]
                              +- Project [a#37, b#38, c#39, d#40]
                                 +- SubqueryAlias `default`.`table1`
                                    +- Relation[a#37,b#38,c#39,d#40] orc
      
      

      In a larger query (where cachedata may be referred on either side only indirectly), this phenomenon can create certain oddities, as the fragment is not replaced with InMemoryRelation, and the fragment is present when the plan is optimized as a whole.

      In Spark 2.1.3, Spark uses InMemoryRelation on both sides.

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            bersprockets Bruce Robbins
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: