Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29664

Column.getItem behavior is not consistent with Scala version

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • PySpark
    • None

    Description

      In PySpark, Column.getItem's behavior is different from the Scala version.

      For example,
      In PySpark:

      df = spark.range(2)
      map_col = create_map(lit(0), lit(100), lit(1), lit(200))
      df.withColumn("mapped", map_col.getItem(col('id'))).show()
      # +---+------+
      # | id|mapped|
      # +---+------+
      # |  0|   100|
      # |  1|   200|
      # +---+------+
      

      In Scala:

      val df = spark.range(2)
      val map_col = map(lit(0), lit(100), lit(1), lit(200))
      // The following getItem results in the following exception, which is the right behavior:
      // java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column id
      //  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
      //  at org.apache.spark.sql.Column.getItem(Column.scala:856)
      //  ... 49 elided
      df.withColumn("mapped", map_col.getItem(col("id"))).show
      
      
      // You have to use apply() to match with PySpark's behavior.
      df.withColumn("mapped", map_col(col("id"))).show
      // +---+------+
      // | id|mapped|
      // +---+------+
      // |  0|   100|
      // |  1|   200|
      // +---+------+
      

      Looking at the code for Scala implementation, PySpark's behavior is incorrect since the argument to getItem becomes `Literal`.

      Attachments

        Activity

          People

            imback82 Terry Kim
            imback82 Terry Kim
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: