(The JIRA received a major update on 2019/02/28. Some comments were based on an earlier version. Please ignore them. New comments start at comment-16778026.)
GPUs and other accelerators have been widely used for accelerating special workloads, e.g., deep learning and signal processing. While users from the AI community use GPUs heavily, they often need Apache Spark to load and process large datasets and to handle complex data scenarios like streaming. YARN and Kubernetes already support GPUs in their recent releases. Although Spark supports those two cluster managers, Spark itself is not aware of GPUs exposed by them and hence Spark cannot properly request GPUs and schedule them for users. This leaves a critical gap to unify big data and AI workloads and make life simpler for end users.
To make Spark be aware of GPUs, we shall make two major changes at high level:
- At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
- Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.
Based on the work done in YARN and Kubernetes to support GPUs and some offline prototypes, we could have necessary features implemented in the next major release of Spark. You can find a detailed scoping doc here, where we listed user stories and their priorities.
- Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
- No regression on scheduler performance for normal jobs.
- Fine-grained scheduling within one GPU card.
- We treat one GPU card and its memory together as a non-divisible unit.
- Support TPU.
- Support Mesos.
- Support Windows.
- Admins who need to configure clusters to run Spark with GPU nodes.
- Data scientists who need to build DL applications on Spark.
- Developers who need to integrate DL features on Spark.