Description
Previously posted this as a stack overflow question but it seems to be a bug.
If I have an R data.frame where one of the column data types is an integer list – i.e., each of the elements in the column embeds an entire R list of integers – then it seems I can convert this data.frame to a SparkR DataFrame just fine... SparkR treats the column as ArrayType(Double).
However, any subsequent operation on this SparkR DataFrame appears to throw an error.
Create an example R data.frame:
indices <- 1:4 myDf <- data.frame(indices) myDf$data <- list(rep(0, 20))
Examine it to make sure it looks okay:
> str(myDf) 'data.frame': 4 obs. of 2 variables: $ indices: int 1 2 3 4 $ data :List of 4 ..$ : num 0 0 0 0 0 0 0 0 0 0 ... ..$ : num 0 0 0 0 0 0 0 0 0 0 ... ..$ : num 0 0 0 0 0 0 0 0 0 0 ... ..$ : num 0 0 0 0 0 0 0 0 0 0 ... > head(myDf) indices data 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Convert it to a SparkR DataFrame:
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib")) sparkR.session(master = "local[*]") mySparkDf <- as.DataFrame(myDf)
Examine the SparkR DataFrame schema; notice that the list column was successfully converted to ArrayType:
> schema(mySparkDf) StructType |-name = "indices", type = "IntegerType", nullable = TRUE |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
However, operating on the SparkR DataFrame throws an error:
> collect(mySparkDf) 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Double is not a valid external type for schema of array<double> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0 ... long stack trace ...
Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.