Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29861

Reduce downtime in Spark standalone HA master switch

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.1.0
    • None
    • Spark Core
    • None

    Description

      As officially stated in the spark HA documention, the recovery process of Spark (standalone) master in HA with zookeeper takes about 1-2 minutes. During this time no spark master is active, which makes interaction with spark essentially impossible. 

      After looking for a way to reduce this downtime, it seems that this is mainly caused by the leader election, which waits for open zookeeper connections to be closed. This seems like an unnecessary downtime for example in case of a planned VM update.

      I have fixed this in my setup by:

      1. Closing open zookeeper connections during spark shutdown
      2. Bumping the curator version and implementing a custom error policy that is tolerant to a zookeeper connection suspension.

      I am preparing a pull request for review / further discussion on this issue.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Nibooor Robin Wolters
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: