Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5576

Masters may drop the first message they send between masters after a network partition

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.28.2
    • 0.28.3, 1.0.0
    • Observed in an OpenStack environment where each master lives on a separate VM.

    • Mesosphere Sprint 38
    • 5

    Description

      We observed the following situation in a cluster of five masters:

      Time Master 1 Master 2 Master 3 Master 4 Master 5
      0 Follower Follower Follower Follower Leader
      1 Follower Follower Follower Follower Partitioned from cluster by downing this VM's network
      2 Elected Leader by ZK Voting Voting Voting Suicides due to lost leadership
      3 Performs consensus Replies to leader Replies to leader Replies to leader Still down
      4 Performs writing Acks to leader Acks to leader Acks to leader Still down
      5 Leader Follower Follower Follower Still down
      6 Leader Follower Follower Follower Comes back up
      7 Leader Follower Follower Follower Follower
      8 Partitioned in the same way as Master 5 Follower Follower Follower Follower
      9 Suicides due to lost leadership Elected Leader by ZK Follower Follower Follower
      10 Still down Performs consensus Replies to leader Replies to leader Doesn't get the message!
      11 Still down Performs writing Acks to leader Acks to leader Acks to leader
      12 Still down Leader Follower Follower Follower

      Master 2 sends a series of messages to the recently-restarted Master 5. The first message is dropped, but subsequent messages are not dropped.

      This appears to be due to a stale link between the masters. Before leader election, the replicated log actors create a network watcher, which adds links to masters that join the ZK group:
      https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159

      This link does not appear to break (Master 2 -> 5) when Master 5 goes down, perhaps due to how the network partition was induced (in the hypervisor layer, rather than in the VM itself).

      When Master 2 tries to send an PromiseRequest to Master 5, we do not observe the expected log message

      Instead, we see a log line in Master 2:

      process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is not connected
      

      The broken link is removed by the libprocess socket_manager and the following WriteRequest from Master 2 to Master 5 succeeds via a new socket.

      Attachments

        Issue Links

          Activity

            People

              kaysoky Joseph Wu
              kaysoky Joseph Wu
              Benjamin Mahler Benjamin Mahler
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: