Uploaded image for project: 'Zeppelin'
  1. Zeppelin
  2. ZEPPELIN-97

pyspark issue with mllib api

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.5.0
    • 0.5.0
    • Interpreters
    • spark 1.4 on mapr hadoop, running on centos 7.0

    Description

      pyspark interpreter seems to have issue accessing python RDD

      import numpy as np
      from sklearn.cross_validation import train_test_split
      from pyspark.mllib.classification import NaiveBayes
      from pyspark.mllib.linalg import Vectors
      from pyspark.mllib.regression import LabeledPoint 
      
      X = np.random.rand(100,3)
      y = np.random.randint(5,size=100)
      
      trainX,testX,trainy,testy = train_test_split(X,y,test_size=0.2)
      
      training = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for (xrow,ylabel) in zip(trainX,trainy)])
      testing = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for (xrow,ylabel) in zip(testX,testy)])
      
      model = NaiveBayes.train(training, 0.1)
      

      above code errors out at last line

      Error:

      (<type 'exceptions.AttributeError'>, AttributeError("'list' object has no attribute '_get_object_id'",), <traceback object at 0x392b638>)
      

      above code runs fine from pyspark shell. Also tested other features like data frames from zepellin pyspark interpreter and they seem to work fine as well.

      Attachments

        Activity

          People

            moon Lee Moon Soo
            bobbych03 Bobby Chowdary
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: