[SPARK-21727] Operating on an ArrayType in a SparkR DataFrame throws error - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: SparkR
Labels:
None

Target Version/s:

2.3.0

Description

Previously posted this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer list – i.e., each of the elements in the column embeds an entire R list of integers – then it seems I can convert this data.frame to a SparkR DataFrame just fine... SparkR treats the column as ArrayType(Double).

However, any subsequent operation on this SparkR DataFrame appears to throw an error.

Create an example R data.frame:

indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))

Examine it to make sure it looks okay:

> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices                                                       data 
1       1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2       2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3       3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4       4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Convert it to a SparkR DataFrame:

library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)

Examine the SparkR DataFrame schema; notice that the list column was successfully converted to ArrayType:

> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE

However, operating on the SparkR DataFrame throws an error:

> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array<double>
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.

Attachments

Issue Links

links to

[Github] Pull Request #20352 (neilalex)

Activity

People

Assignee:: Neil Alexander McQuarrie

Reporter:: Neil Alexander McQuarrie

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Aug/17 20:04

Updated:: 12/Dec/22 18:10

Resolved:: 24/Jan/18 06:38