Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26158

Enhance the accuracy of covariance in RowMatrix for DenseVector

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 3.0.0
    • Component/s: MLlib
    • Labels:
      None

      Description

      Compare Spark computeCovariance function in RowMatrix for DenseVector and Numpy's function cov,

      Find two problem, below is the result:

      1)The Spark function computeCovariance in RowMatrix is not accuracy

      input data

      1.0,2.0,3.0,4.0,5.0
      2.0,3.0,1.0,2.0,6.0

      Numpy function cov result:

      [[2.5   1.75]

       [ 1.75  3.7 ]]

      RowMatrix function computeCovariance result:

      2.5   1.75              

      1.75  3.700000000000001

       

      2)For some input case, the result is not good

      generate input data by below logic

      data1 = np.random.normal(loc=100000, scale=0.000009, size=10000000)
      data2 = np.random.normal(loc=200000, scale=0.000002,size=10000000)

       

      Numpy function cov result:

      [[  8.10536442e-11  -4.35439574e-15]

      [ -4.35439574e-15   3.99928264e-12]]

       

      RowMatrix function computeCovariance result:

      -0.0027484893798828125  0.001491546630859375 

      0.001491546630859375    8.087158203125E-4

        Attachments

          Activity

            People

            • Assignee:
              KyleLi1985 Liang Li
              Reporter:
              KyleLi1985 Liang Li
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: