Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42905

pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

    XMLWordPrintableJSON

Details

    Description

      pyspark.ml.stat.Correlation

      Following is the Scenario where the Correlation function fails for giving correct Spearman Coefficient Results.

      Tested E.g -> Spark DataFrame has 2 columns A and B.

      Column A has 3 Distinct Values and total of 108Million rows

      Column B has 4 Distinct Values and total of 108Million rows

      If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it gives the correct answer even if i run the same code multiple times the same answer is produced. (Each column has only 3-4 distinct values)

       

      Coming to Spark and using Spearman Correlation produces a different results for the same dataframe on multiple runs. (see below) (each column in this df has only 3-4 distinct values)

       

      Basically in python Pandas Df.corr it gives same results on same dataframe on multiple runs which is expected behaviour. However, in Spark using the same data it gives different result, moreover running the same cell with same data multiple times produces different results meaning the output is inconsistent.

      Coming to data the only observation I could conclude is Ties in data. (Only 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark Correlation method as the same data when used in python using df.corr produces consistent results.

      The only Workaround we could find to get consistent and the same output as from python in Spark is by using Pandas UDF as shown below:

       

      We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect and inconsistent results for this case too.

      Only PandasUDF seems to provide consistent results.

       

      Another point to note is : If i add some random noise to the data, which will inturn increase the distinct values in the data. It again gives consistent results for any runs. Which makes me believe that the Python version handles ties correctly and gives consistent results no matter how many ties exist. However, pyspark method is somehow not able to handle many ties in data.

      Attachments

        1. image-2023-03-23-10-55-26-879.png
          36 kB
          dronzer
        2. image-2023-03-23-10-53-37-461.png
          71 kB
          dronzer
        3. image-2023-03-23-10-52-49-392.png
          113 kB
          dronzer
        4. image-2023-03-23-10-52-11-481.png
          22 kB
          dronzer
        5. image-2023-03-23-10-51-28-420.png
          83 kB
          dronzer

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dronzer dronzer
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: