Uploaded image for project: 'TOREE'
  1. TOREE
  2. TOREE-428

Can't use case class in the Scala notebook

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Workaround
    • None
    • 0.2.0
    • Build
    • None

    Description

      the version of docker:
      jupyter/all-spark-notebook:lastest

      the way to start docker:
      docker run -it --rm -p 8888:8888 jupyter/all-spark-notebook:latest
      or
      docker ps -a
      docker start -i containerID

      the steps:

      Visit http://localhost:8888
      Start an toree notebook
      input code above

      import spark.implicits._
      val p = spark.sparkContext.textFile ("../Data/person.txt")
      val pmap = p.map ( _.split (","))
      pmap.collect()
      

      the output:res0: Array[Array[String]] = Array(Array(Barack, Obama, 53), Array(George, Bush, 68), Array(Bill, Clinton, 68))

      case class Persons (first_name:String,last_name: String,age:Int)
      val personRDD = pmap.map ( p => Persons (p(0), p(1), p(2).toInt))
      personRDD.take(1)
      

      the error message:

      org.apache.spark.SparkDriverExecutionException: Execution error
        at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1186)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
        at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1354)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
        ... 39 elided
      Caused by: java.lang.ArrayStoreException: [LPersons;
        at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:90)
        at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:2043)
        at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:2043)
        at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:59)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1182)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      

      The above code is working with the spark-shell. From error message, I speculated that the driver program didn't correctly handle case class Persons to RDD partition.

      Attachments

        Activity

          People

            Unassigned Unassigned
            leeivan Haifeng Li
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: