[SPARK-22232] Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None

Description

The fields in a Row object created from a dict (ie Row(**kwargs)) should be accessed by field name, not by position because Row._new_ sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue.

from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  # Putting fields in alphabetical order masks the issue
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)

Attachments

Issue Links

is duplicated by

SPARK-27712 createDataFrame() reorders row

Closed

relates to

SPARK-24915 Calling SparkSession.createDataFrame with schema can throw exception

Resolved

SPARK-29748 Remove sorting of fields in PySpark SQL Row creation

Resolved

SPARK-27939 Defining a schema with VectorUDT

Resolved

links to

[Github] Pull Request #20280 (BryanCutler)

GitHub Pull Request #20280

(1 links to)

Activity

People

Assignee:: Unassigned

Reporter:: Bago Amirbekian

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 10/Oct/17 02:50

Updated:: 10/Jan/20 22:42

Resolved:: 10/Jan/20 22:42