Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9767

Add self health monitoring in Mesos master

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6.0
    • None
    • master
    • None

    Description

      We have seen issue where Mesos master got stuck and was not responding to HTTP endpoints like "/metrics/snapshot". This results in calls by the frameworks and metrics collector to the master to hang. Currently we emit 'master alive' metric using prometheus. If master hangs, this metrics is not published and we detect the hangs using alerts on top of this metrics. By the time someone would have got the alert and restarted the master process, 15-30mins would have passed by. This results in SLA violation by Mesos cluster users.

      It will be nice to implement a self health check monitoring to detect if the Mesos master is hung/stuck. This will help us to quickly crash the master process so that one of the other member of the quorum can acquire ZK leadership lock.

      We can use the "/master/health" endpoint for health checks. 
      Health checks can be initiated in [src/master/main.cpp|https://github.com/apache/mesos/blob/master/src/master/main.cpp] just after the child master process is spawned.

      We can leverage the [HealthChecker|https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp] for this one. One downside is that HealthChecker currently takes TaskId as an input which is not valid for master health check. 

      We can add following flags to control the self heath checking:

      1. self_monitoring_enabled: Whether self monitoring is enabled.
      2. self_monitoring_consecutive_failures: After this many number of health failures, master is crashed.
      3. self_monitoring_interval_secs: Interval at which health checks are performed.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ggarg Gaurav Garg
            Vinod Kone Vinod Kone
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: