Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32040

Idle cores not being allocated

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.5
    • None
    • Scheduler
    • None

    Description

      Background:

      I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs
      running on EC2. We deploy a Scala web server as a long running jar via
      spark-submit in client mode. Sometimes we get into a state where the
      application ends up with 0 cores due to our in-house autoscaler scaling down
      and killing workers without checking if any of the cores in the worker were
      allocated to existing applications. These applications then end up with 0
      cores, even though there are healthy workers in the cluster.

      However, if i submit a new application or register a new worker in the
      cluster, only then will the master finally reallocate cores to the
      application. This is problematic, because the long running 0 core
      application is stuck.

      Could this be related to the fact that schedule() is only triggered by new
      workers / new applications as commented here?
      https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724

      If that is the case, should the application be calling schedule() when
      removing workers after calling timeOutWorkers()?
      https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417

      The downscaling causes me to see this in my logs, so i am fairly certain
      timeOutWorkers() is being called:

      20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested 
      to set total executors to 1. 
      20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 
      on worker worker-20200608113523-<IP_ADDRESS>-7077 
      20/06/08 11:41:44 WARN Master: Removing 
      worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 
      seconds 
      20/06/08 11:41:44 INFO Master: Removing worker 
      worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 
      20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 
      20/06/08 11:41:44 INFO Master: Telling app of lost worker: 
      worker-20200608113523-10.158.242.213-7077 
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            toopt4 t oo
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: