Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22232

Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.2.0
    • None
    • PySpark, SQL
    • None

    Description

      The fields in a Row object created from a dict (ie Row(**kwargs)) should be accessed by field name, not by position because Row._new_ sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue.

      from pyspark.sql.types import *
      from pyspark.sql import *
      
      def toRow(i):
        return Row(a="a", c=3.0, b=2)
      
      schema = StructType([
        # Putting fields in alphabetical order masks the issue
        StructField("a", StringType(),  False),
        StructField("c", FloatType(), False),
        StructField("b", IntegerType(), False),
      ])
      
      rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
      
      # As long as we don't shuffle things work fine.
      print rdd.toDF(schema).take(2)
      
      # If we introduce a shuffle we have issues
      print rdd.repartition(3).toDF(schema).take(2)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bago.amirbekian Bago Amirbekian
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: