Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46992

Inconsistent results with 'sort', 'cache', and AQE.

    XMLWordPrintableJSON

Details

    Description

       
      With AQE enabled, having sort in the plan changes sample results after caching.

      Moreover, when cached,  collect returns records as if it's not cached, which is inconsistent with count and show.

      A script to reproduce:

      import spark.implicits._
      val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
      
      println("NON CACHED:")
      
      println("  count: " + df.count())
      println("  collect: " + df.collect().mkString(" "))
      df.show()
      
      println("CACHED:")
      df.cache().count()
      
      println("  count: " + df.count())
      println("  collect: " + df.collect().mkString(" "))
      df.show()
      
      df.unpersist()
      

      output:

      NON CACHED:
        count: 2
        collect: [1] [4]
      +---+
      | id|
      +---+
      |  1|
      |  4|
      +---+
      
      CACHED:
        count: 3
        collect: [1] [4]
      +---+
      | id|
      +---+
      |  1|
      |  2|
      |  3|
      +---+
      

      BTW, disabling AQE [spark.conf.set("spark.databricks.optimizer.adaptive.enabled", "false")] helps on Databricks clusters, but locally it has no effect, at least on Spark 3.3.2.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dtarima Denis Tarima
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: