[SPARK-15346] Reduce duplicate computation in picking initial points in LocalKMeans - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: None
Labels:
- performance
Environment:

Ubuntu 14.04

Target Version/s:

2.0.0

Description

Main Issue

I found that for KMans|| in mllib, when dataset is in large scale, after initial KMeans|| finishes and before Lloyd's iteration begins, the program will stuck for a long time without terminal. After testing I see it's stucked with LocalKMeans. And there is a to be improved feature in LocalKMeans.scala in Mllib. After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old centers as below

val costArray = points.map { point =>
      KMeans.fastSquaredDistance(point, centers(0))
    }

Instead this we can keep the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them.

Test

Download LocalKMeans.zip
I provided a attach "LocalKMeans.zip" which contains the code "LocalKMeans2.scala" and dataset "bigKMeansMedia"
LocalKMeans2.scala contains both original version method KMeansPlusPlus and a modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
I added a tests and main function in it so that any one can run the file directly.

How to Test

Replacing mllib.clustering.LocalKMeans.scala in your local repository with my LocalKMeans2.scala or just put them in the same dir.
Modify the path in line 34 (loadAndRun()) with the path you restoring the data file bigKMeansMedia which is also provided in the patch.
Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to clustering number K and iteration number respectively.
Then the console will print the cost time and SE of the two version of KMeans++ respectively.

Test Results

This data is generated from a KMeans|| eperiment in spark, I add some inner function and output the result of KMeans|| initialization and restore.
The first line of the file with format "%d:%d:%d:%d" indicates "the seed:feature num:iteration num (in original KMeans||):points num" of the data.

In my machine the experiment result is as below:

the x-axis is the clustering num k while y-axis is the time in seconds

Attachments

Issue Links

links to

[Github] Pull Request #13133 (mouendless)

Activity

People

Assignee:: Abraham Zhan

Reporter:: Abraham Zhan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/May/16 13:42

Updated:: 19/Jan/18 20:36

Resolved:: 18/May/16 11:05