[SPARK-32040] Idle cores not being allocated - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.5
Fix Version/s: None
Component/s: Scheduler
Labels:
None

Description

Background:

I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs
running on EC2. We deploy a Scala web server as a long running jar via
spark-submit in client mode. Sometimes we get into a state where the
application ends up with 0 cores due to our in-house autoscaler scaling down
and killing workers without checking if any of the cores in the worker were
allocated to existing applications. These applications then end up with 0
cores, even though there are healthy workers in the cluster.

However, if i submit a new application or register a new worker in the
cluster, only then will the master finally reallocate cores to the
application. This is problematic, because the long running 0 core
application is stuck.

Could this be related to the fact that schedule() is only triggered by new
workers / new applications as commented here?
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724

If that is the case, should the application be calling schedule() when
removing workers after calling timeOutWorkers()?
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417

The downscaling causes me to see this in my logs, so i am fairly certain
timeOutWorkers() is being called:

20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested 
to set total executors to 1. 
20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 
on worker worker-20200608113523-<IP_ADDRESS>-7077 
20/06/08 11:41:44 WARN Master: Removing 
worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 
seconds 
20/06/08 11:41:44 INFO Master: Removing worker 
worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 
20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 
20/06/08 11:41:44 INFO Master: Telling app of lost worker: 
worker-20200608113523-10.158.242.213-7077

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: t oo

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Jun/20 20:58

Updated:: 21/Jun/20 02:33