Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32478

Error message to show the schema mismatch in gapply with Arrow vectorization

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.1, 3.1.0
    • SparkR
    • None

    Description

      Currently, the error message is confusing when the output schema type is not matched with the actual R DataFrame in gapply:

      ./bin/sparkR --conf spark.sql.execution.arrow.sparkr.enabled=true
      
      df <- createDataFrame(list(list(a=1L, b="2")))
      count(gapply(df, "a", function(key, group) { group }, structType("a int, b int")))
      
        org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in stage 2.0 failed 1 times, most recent failure: Lost task 43.0 in stage 2.0 (TID 2, 192.168.35.193, executor driver): java.lang.UnsupportedOperationException
      	at org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getInt(ArrowColumnVector.java:212)
      	...
      

      We should probably also document that the type should be matched always.

      Attachments

        Activity

          People

            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: