[SPARK-38137] Repartition+Shuffle+ non deterministic function leads to bad results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.1, 3.2.1
Fix Version/s: None
Component/s: Shuffle, Spark Core
Labels:
None

Description

Hi,

The results when using a non deterministic function in repartition (like rand) leads into incorrect results.

Reproduce: (correct)

// code placeholder
import scala.sys.process._
import org.apache.spark.TaskContext
import org.apache.spark.sql.functions.rand

val res = spark.range(0, 100 * 100, 1).repartition(200).map { x =>
  x
}.repartition(200).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
    throw new Exception("pkill -f java".!!)
  }
  x
}
res.distinct().count()

The correct result 10000

Reproduce: (bad)

// code placeholder
import scala.sys.process._
import org.apache.spark.TaskContext
import org.apache.spark.sql.functions.rand

val res = spark.range(0, 100 * 100, 1).repartition(200).map { x =>
  x
}.repartition(10, Array(rand):_*).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
    throw new Exception("pkill -f java".!!)
  }
  x
}
res.distinct().count()

The bad result 9396

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Jakub Leś

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Feb/22 09:17

Updated:: 08/Feb/22 09:32