[SPARK-26498] Integrate barrier execution with MMLSpark's LightGBM - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: ML, MLlib
Labels:
None

Description

I would like to use the new barrier execution mode introduced in spark 2.4 with LightGBM in the spark package mmlspark but I ran into some issues.

Currently, the LightGBM distributed learner tries to figure out the number of cores on the cluster and then does a coalesce and a mapPartitions, and inside the mapPartitions we do a NetworkInit (where the address:port of all workers needs to be passed in the constructor) and pass the data in-memory to the native layer of the distributed lightgbm learner.

With barrier execution mode, I think the code would become much more robust. However, there are several issues that I am running into when trying to move my code over to the new barrier execution mode scheduler:

Does not support dynamic allocation – however, I think it would be convenient if it restarted the job when the number of workers has decreased and allowed the dev to decide whether to restart the job if the number of workers increased

Does not work with DataFrame or Dataset API, but I think it would be much more convenient if it did.

How does barrier execution mode deal with #partitions > #tasks? If the number of partitions is larger than the number of “tasks” or workers, can barrier execution mode automatically coalesce the dataset to have # partitions == # tasks?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ilya Matiach

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/Dec/18 22:36

Updated:: 17/Mar/20 09:46