[SPARK-26104] make pci devices visible to task scheduler - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: Spark Core
Labels:
- Hydrogen

Description

Spark Task scheduling has long time consider CPU only, depending on how many vcores each executor has at given moment, the task were scheduled as long as enough vcores become available.
Moving to deep learning use cases, The fundamental computation and processing unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU memory.

Deep learning framework build on top of GPU fleets requires fixture of task to number of GPUs spark haven't support yet. E.g a horord task requires 2 GPUs running uninterrupted before it finish regardless how CPU availability in executor. In Uber peloton executor scheduler, the number of cores available could be more than what user asked due to the fact it might get over provisioned.

Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run into unexpected states.

related jiras allocating executor containers with gpu resources, serve as bootstrap phase usage
SPARK-19320 Mesos ~~SPARK-24491~~ K8s ~~SPARK-20327~~ YARN

Existing SPIP: Accelerator Aware Task Scheduling For Spark ~~SPARK-24615~~, compatible with design, approach is a bit different as it tacks utilization of pci devices where customized taskscheduler could either fallback to "best to have" approach or implement "must have" approach stated above.

Attachments

Issue Links

links to

[Github] Pull Request #23073 (chenqin)

GitHub Pull Request #23073

Activity

People

Assignee:: Unassigned

Reporter:: Chen Qin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Nov/18 22:50

Updated:: 16/Mar/20 22:53