Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30941

PySpark Row can be instantiated with duplicate field names

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
    • 2.4.6, 3.0.0
    • PySpark

    Description

      It is possible to create a Row that has fields with the same name when calling `collect()` after a join. Given that the Row constructor itself doesn't allow this, this seems to be undesired behavior.

      This can possibly cause correctness issues because different ways of getting values produce different results: _get_item_ will return the leftmost value, while asDict() will return the rightmost value (because the former uses an index search and the latter uses a dictionary generator).

      >>> manual_output_row = Row(a=1, b=1, b=2)
      {{ File "<stdin>", line 1}}
      SyntaxError: keyword argument repeated

      >>> input_rows = Row(a=1, b=1), Row(a=1, b=2)
      >>> df1, df2 = (spark.createDataFrame([r]) for r in input_rows)
      >>> df3 = df1.join(df2, "a")
      >>> output_row = df3.collect()[0]
      >>> output_row
      Row(a=1, b=1, b=2)
      >>> output_row["b"]
      1
      >>> output_row.asDict()["b"]
      2 

      *SPARK 1.6.3*

      >>> from pyspark.sql.types import Row
      >>> input_rows = Row(a=1, b=1), Row(a=1, b=2)
      >>> df1, df2 = (sqlContext.createDataFrame([r]) for r in input_rows)
      >>> df3 = df1.join(df2, "a")
      >>> output_row = df3.collect()[0]
      >>> output_row
      Row(a=1, b=1, b=2)
      >>> output_row["b"]
      1
      >>> output_row.asDict()["b"]
      2
      >>> sc.version
      u'1.6.3'
      

      Attachments

        Issue Links

          Activity

            People

              gurwls223 Hyukjin Kwon
              droher David Roher
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: