[SPARK-46992] Inconsistent results with 'sort', 'cache', and AQE. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.3.2, 3.5.0
Fix Version/s: None
Component/s: Spark Core
Labels:
- correctness
- pull-request-available

Description

With AQE enabled, having sort in the plan changes sample results after caching.

Moreover, when cached, collect returns records as if it's not cached, which is inconsistent with count and show.

A script to reproduce:

import spark.implicits._
val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)

println("NON CACHED:")

println("  count: " + df.count())
println("  collect: " + df.collect().mkString(" "))
df.show()

println("CACHED:")
df.cache().count()

println("  count: " + df.count())
println("  collect: " + df.collect().mkString(" "))
df.show()

df.unpersist()

output:

NON CACHED:
  count: 2
  collect: [1] [4]
+---+
| id|
+---+
|  1|
|  4|
+---+

CACHED:
  count: 3
  collect: [1] [4]
+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+

BTW, disabling AQE [spark.conf.set("spark.databricks.optimizer.adaptive.enabled", "false")] helps on Databricks clusters, but locally it has no effect, at least on Spark 3.3.2.

Attachments

Issue Links

links to

GitHub Pull Request #45181

Activity

People

Assignee:: Unassigned

Reporter:: Denis Tarima

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Feb/24 17:10

Updated:: 25/Mar/24 11:23