[MESOS-5378] Terminating a framework during master failover leads to orphaned tasks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.27.2, 0.28.1
Fix Version/s: None
Component/s: framework, master
Labels:

Story Points:
3

Description

Repro steps:

1) Setup:

bin/mesos-master.sh --work_dir=/tmp/master
bin/mesos-slave.sh --work_dir=/tmp/slave --master=localhost:5050
src/mesos-execute --checkpoint --command="sleep 1000" --master=localhost:5050 --name="test"

2) Kill all three from (1), in the order they were started.

3) Restart the master and agent. Do not restart the framework.

Result)

The agent will reconnect to an orphaned task.
The Web UI will report no memory usage
curl localhost:5050/metrics/snapshot will say: "master/mem_used": 128,

Cause)
When a framework registers with the master, it provides a failover_timeout, in case the framework disconnects. If the framework disconnects and does not reconnect within this failover_timeout, the master will kill all tasks belonging to the framework.

However, the master does not persist this failover_timeout across master failover. The master will "forget" about a framework if:
1) The master dies before failover_timeout passes.
2) The framework dies while the master is dead.

When the master comes back up, the agent will re-register. The agent will report the orphaned task(s). Because the master failed over, it does not know these tasks are orphans (i.e. it thinks the frameworks might re-register).

Proposed solution)
The master should save the FrameworkID and failover_timeout in the registry. Upon recovery, the master should resume the failover_timeout timers.

Attachments

Issue Links

duplicates

MESOS-4659 Avoid leaving orphan task after framework failure + master failover

Accepted

Activity

People

Assignee:: Unassigned

Reporter:: Joseph Wu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/May/16 02:45

Updated:: 30/Jun/16 14:37

Resolved:: 30/Jun/16 14:37