Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5378

Terminating a framework during master failover leads to orphaned tasks

    XMLWordPrintableJSON

Details

    • 3

    Description

      Repro steps:

      1) Setup:

      bin/mesos-master.sh --work_dir=/tmp/master
      bin/mesos-slave.sh --work_dir=/tmp/slave --master=localhost:5050
      src/mesos-execute --checkpoint --command="sleep 1000" --master=localhost:5050 --name="test"
      

      2) Kill all three from (1), in the order they were started.

      3) Restart the master and agent. Do not restart the framework.

      Result)

      • The agent will reconnect to an orphaned task.
      • The Web UI will report no memory usage
      • curl localhost:5050/metrics/snapshot will say: "master/mem_used": 128,

      Cause)
      When a framework registers with the master, it provides a failover_timeout, in case the framework disconnects. If the framework disconnects and does not reconnect within this failover_timeout, the master will kill all tasks belonging to the framework.

      However, the master does not persist this failover_timeout across master failover. The master will "forget" about a framework if:
      1) The master dies before failover_timeout passes.
      2) The framework dies while the master is dead.

      When the master comes back up, the agent will re-register. The agent will report the orphaned task(s). Because the master failed over, it does not know these tasks are orphans (i.e. it thinks the frameworks might re-register).

      Proposed solution)
      The master should save the FrameworkID and failover_timeout in the registry. Upon recovery, the master should resume the failover_timeout timers.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kaysoky Joseph Wu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: