Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21727

Operating on an ArrayType in a SparkR DataFrame throws error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.3.0
    • SparkR
    • None

    Description

      Previously posted this as a stack overflow question but it seems to be a bug.

      If I have an R data.frame where one of the column data types is an integer list – i.e., each of the elements in the column embeds an entire R list of integers – then it seems I can convert this data.frame to a SparkR DataFrame just fine... SparkR treats the column as ArrayType(Double).

      However, any subsequent operation on this SparkR DataFrame appears to throw an error.

      Create an example R data.frame:

      indices <- 1:4
      myDf <- data.frame(indices)
      myDf$data <- list(rep(0, 20))
      

      Examine it to make sure it looks okay:

      > str(myDf) 
      'data.frame':   4 obs. of  2 variables:  
       $ indices: int  1 2 3 4  
       $ data   :List of 4
         ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
         ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
         ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
         ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
      
      > head(myDf)   
        indices                                                       data 
      1       1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
      2       2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
      3       3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
      4       4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
      

      Convert it to a SparkR DataFrame:

      library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
      sparkR.session(master = "local[*]")
      mySparkDf <- as.DataFrame(myDf)
      

      Examine the SparkR DataFrame schema; notice that the list column was successfully converted to ArrayType:

      > schema(mySparkDf)
      StructType
      |-name = "indices", type = "IntegerType", nullable = TRUE
      |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
      

      However, operating on the SparkR DataFrame throws an error:

      > collect(mySparkDf)
      17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
      java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
      java.lang.Double is not a valid external type for schema of array<double>
      if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
      else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
      ... long stack trace ...
      

      Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.

      Attachments

        Activity

          People

            neilalex Neil Alexander McQuarrie
            neilalex Neil Alexander McQuarrie
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: