Description
The fields in a Row object created from a dict (ie Row(**kwargs)) should be accessed by field name, not by position because Row._new_ sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue.
from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ # Putting fields in alphabetical order masks the issue StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2)
Attachments
Issue Links
- is duplicated by
-
SPARK-27712 createDataFrame() reorders row
- Closed
- relates to
-
SPARK-24915 Calling SparkSession.createDataFrame with schema can throw exception
- Resolved
-
SPARK-29748 Remove sorting of fields in PySpark SQL Row creation
- Resolved
-
SPARK-27939 Defining a schema with VectorUDT
- Resolved
- links to