Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47650

Spark DataFrame processed the same partition twice, which can be reproduced with very simple code.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • 3.5.1
    • None
    • PySpark, SQL

    Description

      The data in Partition 0 was executed twice.

      Here is the reproduction code; the issue occurs every time.
      It appears that the string starting with "0cool" is printed twice.

       

      from pyspark.sql import SparkSession
      
      spark = SparkSession.builder \
          .appName("llm hard negs records") \
          .master(f"local[8]") \
          .getOrCreate()
      
      
      def process_partition(index, partition):
          results = []
          s = 0
      
          for _ in partition:
              row = {"Result": "cool"}
              results.append(row)
              s += 1
      
          print(str(index) + "cool" + str(s))
          return results
      
      
      data = list(range(2000))
      
      results_rdd = spark.sparkContext.parallelize(data).repartition(8).mapPartitionsWithIndex(process_partition)
      results_df = results_rdd.toDF(["Query"])
      
      output_path = "/tmp/bc_inputs6"
      results_df.write.json(output_path, mode="overwrite")
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              adol yinan zhan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: