[SPARK-29861] Reduce downtime in Spark standalone HA master switch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

As officially stated in the spark HA documention, the recovery process of Spark (standalone) master in HA with zookeeper takes about 1-2 minutes. During this time no spark master is active, which makes interaction with spark essentially impossible.

After looking for a way to reduce this downtime, it seems that this is mainly caused by the leader election, which waits for open zookeeper connections to be closed. This seems like an unnecessary downtime for example in case of a planned VM update.

I have fixed this in my setup by:

Closing open zookeeper connections during spark shutdown
Bumping the curator version and implementing a custom error policy that is tolerant to a zookeeper connection suspension.

I am preparing a pull request for review / further discussion on this issue.

Attachments

Issue Links

links to

GitHub Pull Request #26598

Activity

People

Assignee:: Unassigned

Reporter:: Robin Wolters

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Nov/19 15:45

Updated:: 29/Mar/20 00:11