Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25306

Avoid skewed filter trees to speed up `createFilter` in ORC

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
    • 2.4.0
    • SQL
    • None

    Description

      In both ORC data sources, createFilter function has exponential time complexity due to its skewed filter tree generation. This PR aims to improve it by using new buildTree function.

      REPRODUCE

      // Create and read 1 row table with 1000 columns
      sql("set spark.sql.orc.filterPushdown=true")
      val selectExpr = (1 to 1000).map(i => s"id c$i")
      spark.range(1).selectExpr(selectExpr: _*).write.mode("overwrite").orc("/tmp/orc")
      print(s"With 0 filters, ")
      spark.time(spark.read.orc("/tmp/orc").count)
      
      // Increase the number of filters
      (20 to 30).foreach { width =>
        val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
        print(s"With $width filters, ")
        spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
      }
      

      RESULT

      With 0 filters, Time taken: 653 ms                                              
      With 20 filters, Time taken: 962 ms
      With 21 filters, Time taken: 1282 ms
      With 22 filters, Time taken: 1982 ms
      With 23 filters, Time taken: 3855 ms
      With 24 filters, Time taken: 6719 ms
      With 25 filters, Time taken: 12669 ms
      With 26 filters, Time taken: 25032 ms
      With 27 filters, Time taken: 49585 ms
      With 28 filters, Time taken: 98980 ms    // over 1 min 38 seconds
      With 29 filters, Time taken: 198368 ms   // over 3 mins
      With 30 filters, Time taken: 393744 ms   // over 6 mins
      

      Attachments

        Activity

          People

            dongjoon Dongjoon Hyun
            dongjoon Dongjoon Hyun
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: