Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7057

Consider using the relink functionality of libprocess in the executor driver.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.2, 1.1.0
    • 1.1.2, 1.2.0
    • None
    • Mesosphere Sprint 51
    • 2

    Description

      As outlined in the root cause analysis for MESOS-5332, it is possible for a iptables firewall to terminate an idle connection after a timeout. (the default is 5 days). Once this happens, the executor driver is not notified of the disconnection. It keeps on thinking that it is still connected with the agent.

      When the agent process is restarted, the executor still tries to re-use the old broken connection to send the re-register message to the agent. This is when it eventually realizes that the connection is broken (due to the nature of TCP) and calls the exited callback and commits suicide in 15 minutes upon the recovery timeout.

      To offset this, an executor should always relink when it receives a reconnect request from the agent.

      Attachments

        Issue Links

          Activity

            People

              anandmazumdar Anand Mazumdar
              anandmazumdar Anand Mazumdar
              Vinod Kone Vinod Kone
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: