[SPARK-42905] pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.3.0
Fix Version/s: None
Component/s: ML
Labels:
- correctness
- pull-request-available

Description

pyspark.ml.stat.Correlation

Following is the Scenario where the Correlation function fails for giving correct Spearman Coefficient Results.

Tested E.g -> Spark DataFrame has 2 columns A and B.

Column A has 3 Distinct Values and total of 108Million rows

Column B has 4 Distinct Values and total of 108Million rows

If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it gives the correct answer even if i run the same code multiple times the same answer is produced. (Each column has only 3-4 distinct values)

Coming to Spark and using Spearman Correlation produces a different results for the same dataframe on multiple runs. (see below) (each column in this df has only 3-4 distinct values)

Basically in python Pandas Df.corr it gives same results on same dataframe on multiple runs which is expected behaviour. However, in Spark using the same data it gives different result, moreover running the same cell with same data multiple times produces different results meaning the output is inconsistent.

Coming to data the only observation I could conclude is Ties in data. (Only 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark Correlation method as the same data when used in python using df.corr produces consistent results.

The only Workaround we could find to get consistent and the same output as from python in Spark is by using Pandas UDF as shown below:

We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect and inconsistent results for this case too.

Only PandasUDF seems to provide consistent results.

Another point to note is : If i add some random noise to the data, which will inturn increase the distinct values in the data. It again gives consistent results for any runs. Which makes me believe that the Python version handles ties correctly and gives consistent results no matter how many ties exist. However, pyspark method is somehow not able to handle many ties in data.