Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30941

PySpark Row can be instantiated with duplicate field names

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
    • 2.4.6, 3.0.0
    • PySpark

    Description

      It is possible to create a Row that has fields with the same name when calling `collect()` after a join. Given that the Row constructor itself doesn't allow this, this seems to be undesired behavior.

      This can possibly cause correctness issues because different ways of getting values produce different results: _get_item_ will return the leftmost value, while asDict() will return the rightmost value (because the former uses an index search and the latter uses a dictionary generator).

      >>> manual_output_row = Row(a=1, b=1, b=2)
      {{ File "<stdin>", line 1}}
      SyntaxError: keyword argument repeated

      >>> input_rows = Row(a=1, b=1), Row(a=1, b=2)
      >>> df1, df2 = (spark.createDataFrame([r]) for r in input_rows)
      >>> df3 = df1.join(df2, "a")
      >>> output_row = df3.collect()[0]
      >>> output_row
      Row(a=1, b=1, b=2)
      >>> output_row["b"]
      1
      >>> output_row.asDict()["b"]
      2 

      *SPARK 1.6.3*

      >>> from pyspark.sql.types import Row
      >>> input_rows = Row(a=1, b=1), Row(a=1, b=2)
      >>> df1, df2 = (sqlContext.createDataFrame([r]) for r in input_rows)
      >>> df3 = df1.join(df2, "a")
      >>> output_row = df3.collect()[0]
      >>> output_row
      Row(a=1, b=1, b=2)
      >>> output_row["b"]
      1
      >>> output_row.asDict()["b"]
      2
      >>> sc.version
      u'1.6.3'
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            gurwls223 Hyukjin Kwon
            droher David Roher
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment