Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10305

PySpark createDataFrame on list of LabeledPoints fails (regression)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.5.0
    • 1.5.0
    • ML, PySpark, SQL
    • None

    Description

      The following code works in 1.4 but fails in 1.5:

      import numpy as np
      from pyspark.mllib.regression import LabeledPoint
      from pyspark.mllib.linalg import Vectors
      
      lp1 = LabeledPoint(1.0, Vectors.sparse(5, np.array([0, 1]), np.array([2.0, 21.0])))
      lp2 = LabeledPoint(0.0, Vectors.sparse(5, np.array([2, 3]), np.array([2.0, 21.0])))
      tmp = [lp1, lp2]
      sqlContext.createDataFrame(tmp).show()
      

      The failure is:

      ValueError: Unexpected tuple LabeledPoint(1.0, (5,[0,1],[2.0,21.0])) with StructType
      ---------------------------------------------------------------------------
      ValueError                                Traceback (most recent call last)
      <ipython-input-1-0e7cb8772e10> in <module>()
            6 lp2 = LabeledPoint(0.0, Vectors.sparse(5, np.array([2, 3]), np.array([2.0, 21.0])))
            7 tmp = [lp1, lp2]
      ----> 8 sqlContext.createDataFrame(tmp).show()
      
      /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)
          404             rdd, schema = self._createFromRDD(data, schema, samplingRatio)
          405         else:
      --> 406             rdd, schema = self._createFromLocal(data, schema)
          407         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
          408         jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
      
      /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in _createFromLocal(self, data, schema)
          335 
          336         # convert python objects to sql data
      --> 337         data = [schema.toInternal(row) for row in data]
          338         return self._sc.parallelize(data), schema
          339 
      
      /home/ubuntu/databricks/spark/python/pyspark/sql/types.pyc in toInternal(self, obj)
          539                 return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
          540             else:
      --> 541                 raise ValueError("Unexpected tuple %r with StructType" % obj)
          542         else:
          543             if isinstance(obj, dict):
      
      ValueError: Unexpected tuple LabeledPoint(1.0, (5,[0,1],[2.0,21.0])) with StructType
      

      Attachments

        Activity

          People

            davies Davies Liu
            josephkb Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: