Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6008

zip two rdds derived from pickleFile fails

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.3.0
    • 1.2.2, 1.3.0
    • PySpark
    • None

    Description

      Read an rdd from a pickle file.
      Then create another from the first, and zip them together.

      from pyspark import SparkContext
      sc = SparkContext()
      print sc.version

      r = sc.parallelize(range(1, 1000))
      r.saveAsPickleFile('file')

      rdd = sc.pickleFile('file')
      res = rdd.map(lambda row: row, preservesPartitioning=True)
      z = rdd.zip(res)
      print z.take(1)

      Gives the following error:

      File "bug.py", line 30, in <module>
      print z.take(1)
      File "/home/ubuntu/spark/python/pyspark/rdd.py", line 1225, in take
      res = self.context.runJob(self, takeUpToNumLeft, p, True)
      File "/home/ubuntu/spark/python/pyspark/context.py", line 843, in runJob
      it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal)
      File "/home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in _call_
      File "/home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 8, localhost): org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition
      at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:746)
      at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.foreach(RDD.scala:742)
      at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:406)
      at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:244)
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613)
      at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)

      Attachments

        Issue Links

          Activity

            People

              davies Davies Liu
              cchayden Charles Hayden
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: