Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27335

cannot collect() from Correlation.corr

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 2.4.0
    • None
    • ML
    • None

    Description

      reproducing the bug from the example in the documentation:

       

       

      import pyspark
      from pyspark.ml.linalg import Vectors
      from pyspark.ml.stat import Correlation
      spark = pyspark.sql.SparkSession.builder.getOrCreate()
      dataset = [[Vectors.dense([1, 0, 0, -2])],
       [Vectors.dense([4, 5, 0, 3])],
       [Vectors.dense([6, 7, 0, 8])],
       [Vectors.dense([9, 0, 0, 1])]]
      dataset = spark.createDataFrame(dataset, ['features'])
      df = Correlation.corr(dataset, 'features', 'pearson')
      df.collect()
       
      

      This produces the following stack trace:

       

      ---------------------------------------------------------------------------
      AttributeError                            Traceback (most recent call last)
      <ipython-input-92-e7889fa5d198> in <module>()
           11 dataset = spark.createDataFrame(dataset, ['features'])
           12 df = Correlation.corr(dataset, 'features', 'pearson')
      ---> 13 df.collect()
      
      /opt/spark/python/pyspark/sql/dataframe.py in collect(self)
          530         [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
          531         """
      --> 532         with SCCallSiteSync(self._sc) as css:
          533             sock_info = self._jdf.collectToPython()
          534         return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
      
      /opt/spark/python/pyspark/traceback_utils.py in __enter__(self)
           70     def __enter__(self):
           71         if SCCallSiteSync._spark_stack_depth == 0:
      ---> 72             self._context._jsc.setCallSite(self._call_site)
           73         SCCallSiteSync._spark_stack_depth += 1
           74 
      
      AttributeError: 'NoneType' object has no attribute 'setCallSite'

       

       

      Analysis:

      Somehow the dataframe properties `df.sql_ctx.sparkSession._jsparkSession`, and `spark._jsparkSession` do not match with the ones available in the spark session.

      The following code fixes the problem (I hope this helps you narrowing down the root cause)

       

      df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession
      df._sc = spark._sc
      
      df.collect()
      
      >>> [Row(pearson(features)=DenseMatrix(4, 4, [1.0, 0.0556, nan, 0.4005, 0.0556, 1.0, nan, 0.9136, nan, nan, 1.0, nan, 0.4005, 0.9136, nan, 1.0], False))]

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            natalinobusa Natalino Busa
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: