Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38137

Repartition+Shuffle+ non deterministic function leads to bad results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.1, 3.2.1
    • None
    • Shuffle, Spark Core
    • None

    Description

      Hi,

      The results when using a non deterministic function in repartition (like rand) leads into incorrect results.

      Reproduce: (correct)

       

      // code placeholder
      import scala.sys.process._
      import org.apache.spark.TaskContext
      import org.apache.spark.sql.functions.rand
      
      val res = spark.range(0, 100 * 100, 1).repartition(200).map { x =>
        x
      }.repartition(200).map { x =>
        if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
          throw new Exception("pkill -f java".!!)
        }
        x
      }
      res.distinct().count() 

      The correct result 10000 

      Reproduce: (bad)

       

      // code placeholder
      import scala.sys.process._
      import org.apache.spark.TaskContext
      import org.apache.spark.sql.functions.rand
      
      val res = spark.range(0, 100 * 100, 1).repartition(200).map { x =>
        x
      }.repartition(10, Array(rand):_*).map { x =>
        if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
          throw new Exception("pkill -f java".!!)
        }
        x
      }
      res.distinct().count() 

      The bad result 9396 

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            jles Jakub Leś
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: